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PREFACE 


This  Memorandum  was  prepared  for  the  Advanced  Research 
Projects  Agency  (ARPA)  as  part  of  RAND's  continued  interest 
in  the  problems  of  data  processing  for  ballistic  missile 
defense . 

The  original  intent  was  to  write  specifically  on  the 
topic  of  computer  reliability  for  missile  defense,  but  the 
scope  increased  to  cover  the  broader  field  of  the  opera¬ 
tional  availability  of  data  processing  systems,  and  in 
particular,  those  ground-based  systems  which  can  be  repaired. 

Selected  portions  of  this  report  should  be  of  special 
interest  to  statisticians,  circuit  designers,  and  program¬ 
mers;  and  the  conclusions  should  interest  operations  and 
systems  analysts  and  others  who  are  concerned  with  computer 
reliability. 

This  work  should  not  be  construed  as  a  handbook  on 
data  processor  reliability;  the  treatment  of  the  topics 
is  not  uniform,  the  authors  having  been  guided  to  a 
large  extent  by  their  own  opinions  and  interests.  We 
ask,  therefore,  the  tolerance  of  any  specialist  whose 
particular  area  of  interest  may  appear  slighted.  Space  does 
not  permit  completeness  in  all  the  areas  which  contribute 
to  a  reliable  computer. 

One  of  the  authors,  Rodger  Lowe,  is  Vice  President  of 
the  Mesa  Scientific  Corporation  of  Inglewood,  California 
and  a  consultant  to  The  RAND  Corporation. 
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SUMMARY 


The  electronic  computer  plays  a  steadily  increasing 
role  in  the  affairs  of  man;  and  man,  in  turn,  relies  more 
and  more  on  the  voice  of  the  computer.  As  this  trend 
continues,  the  computer  grows  in  size  and  cost,  and  re¬ 
liability  becomes  not  just  a  desirable  feature,  but  an 
absolute  necessity.  The  effects  of  a  computer  failing 
during  a  time  of  national  emergency  or  when  the  control 
and  safety  of  an  entire  steel  mill  are  at  stake  can  be 
catastrophic . 

This  Memorandum  discusses  the  many  aspects  (both 
qualitative  and  quantitative)  of  obtaining  a  reliable 
digital  computer  and,  in  particular,  investigates  that 
class  of  ground-based  data  processing  systems  where  repair 
is  possible. 

The  study  begins  by  reviewing  the  reliability  of  com¬ 
puter  parts  (transistors,  capacitors,  integrated  circuits) 
and  applies  the  results  to  a  large  variety  of  probabilistic 
models  of  system  availability.  Further,  it  discusses  the 
availability  of  ground-based  data  processing  systems-- 
specifically,  the  probability  that  a  repairable  computer 
which  should  be  ready  will  in  fact  be  ready  for  use  at 
some  arbitr?ry  future  time.  It  is  concluded  that  part 
failure  distributions  show  a  form  of  decreasing  failure 
rates  for  'the  entire  population  which  in  no  way  correlates 
with  the  predicted  behavior  of  the  ideal  part.  The  total 
part  population  shows  a  decreasing  failure  rate  because, 
and  only  because,  various  controllably  small  subgroups 
show  initially  increasing  failure  rates  until  every  member 
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of  the  subgroup  has  failed,  at  which  time  the  failure  rate 
of  the  entire  population  effectively  decreases. 

Next,  the  authors  survey  the  various  machine  structures 
which  yield  higher  reliability  and  show  that,  where  service 
is  available,  redundancy  is  never  a  contender  as  a  means  to 
high  reliability.  With  service,  the  multi-processor  pro¬ 
vides  the  highest  availability,  the  multiple-processor 
(duplex  or  triplex),  second  highest. 

A  prediction  of  the  failure  rates  for  the  best  parts 
available  in  1968  yields  values  which  are  roughly  about  a 
factor  of  three  smaller  than  the  best  1965  values,  and  a 
factor  of  ten  smaller  than  what  are  considered  "good"  1965 
parts.  Significant  improvements  in  machine  availability 
will  be  realized  as  a  result  of  this  decrease  in  part 
failure  rate. 

Computer  availability  can  also  be  improved  by  decreasing 
the  time  spent  in  repairing  a  faulty  machine.  A  survey  of 
developments  in  automatic  fault  diagnosis  and  isolation 
reveals  that  systems  of  considerable  elegance  and  power  are 
now  available,  though  expensive.  Their  employment  would 
considerably  reduce  service  time. 

Chapter  VII  gives  a  much  fuller  summary  of  this  Memo¬ 
randum.  There,  also,  the  reader  will  find  references  into 
the  text  for  explicit  recommendations  on  improving  the 
reliability  of  the  computer  and  its  associated  programs. 
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processor)  . 

ro  2m+l  =  tae  order  of  redundancy,  m=0,l,2,  (redundant 

computer) . 

The  number  of  single  computers  in  the  system 
(multi-processor) . 

N  The  total  number  of  parts  (assumed  to  be  identical) 

in  the  entire  system. 

P(t)  The  availability;  the  probability  that  the  system 
is  available  (on)  at  time  t. 

P^  The  asymptotic  value  of  P(t),  lim  P(t) 

£  -rfo 

r(t)  The  failure  rate,  Pr[t  £  Tf  £  t+dt|Tf  £  t]. 

T^  The  time  to  first  failure. 

T  The  time  to  service, 

s 

oi  A  parameter  of  tK.e  Weibull  distribution. 

A  A  small  time  increment. 
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rj  MA/N 

A  The  constant  failure  rate  of  a  part  whose  failure 

distribution  is  exponential. 

y  The  constant  service  rate  of  a  service  process 

whose  distribution  is  exponential. 


i 
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Chapter  I 
INTRODUCTION 


This  study  began  with  a  question:  "Are  contemporary 
computers  reliable  enough  to  be  used  in  future  ballistic 
missile  defense  (BMD)  systems?"  Several  reasons  suggest 
they  might  not  be.  Two  major  ones  are:  1)  the  tremendous 
peak  processing  load  for  urban  defense  requires  a  comput¬ 
ing  complex  of  awesome  size/  and  2)  in  the  case  of  hard- 
point  defense  (e.g.,  hardened  missile  sites)  the  problem 
of  servicing  the  computer  in  remote  areas,  even  if  its 
size  were  not  comparable  to  the  urban  behemoth,  still 
raises  a  serious  doubt  as  to  whether  the  data  processor 
would  be  available  for  use  if  ever  it  were  called  upon. 

It  was  subsequently  decided  to  expand  the  question 
to  include  the  entire  subject  of  digital  computer  relia¬ 
bility,  and  to  emphasize  those  large,  ground-based  computers 
where  service  is  possible.  The  size  of  the  processor  and 
the  nature  of  the  service  enter  as  parameters  of  a  larger 
model.  By  providing  estimates  for  the  values  of  these 
parameters  (i.e.,  by  guessing  the  size  of  the  machine  and 
how  to  fix  it  when  it  breaks)  the  authors  prevent  the 
special  problem  (BMD  data  processing)  from  suffering  by 
the  generalization  of  the  problem. 

This  decision  has  not  been  without  dividends  to  the 
authors.  It  relieved  them  of  the  chore  of  actually  esti¬ 
mating  the  size  and  logistic  properties  of  any  proposed 


^For  instance,  the  Univac  development  for  Nike-X 
announced  in  Aviation  Week,  November  30,  1964. 
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BMD  system,  although  this  has  partially  been  done  for  the 
hard-point  defense  case  [1],  It  does  mean  that  the  reader 
who  wants  to  apply  the  results  to  a  specific  system  must 
be  able  to  say,  for  instance,  "I've  got  about  a  50,000 
transistor  computer  made  out  of  such-and-such  quality 
parts  and  I'll  probably  spend  an  hour  repairing  it  when 
it's  down."  There  is  much  more  to  the  story,  but  essentially 
the  user  (not  so  much  the  manufacturer)  must  have  some  idea 
of  what  is  required  to  solve  his  computing  problem  and  how 
he  intends  to  provide  the  necessary  service. 

The  reader  should  also  know  what  level  of  reliability 
he  desires.  Whether  the  probability  of  the  system  being 
available  (see  p.  3)  should  be  .90  or  .99990  is  left  to  the 
decision-maker,  and  the  availability,  as  it  is  presented 
here,  is  only  a  measure  which  is  functionally  related  to 
the  many  parameters  of  the  system  with  no  subjective  value 
placed  on  it. 

Some  mention  should  be  made  at  the  outset  about  the 
use  of  certain  terms.  The  word  "reliability"  appears  in 
the  title  and  continually  in  the  text.  Presently,  a  strict 
definition  of  reliability,  along  with  other  words,  will 
appear,  but  it  is  not  the  authors'  intent  to  aid  in  the 
proliferation  of  precise  terms  with  which  the  reliability 
field  already  abounds.  To  this  end,  the  word  reliability 
has  the  usual  colloquial  meaning:  being  just  a  measure  of 
whether  "it  works  or  doesn't." 

One  bias  in  viewpoint  remains  from  the  BMD  beginnings. 
Namely,  most  of  the  effort  in  probabilistic  analysis  has 
been  to  ascertain  if  the  computing  system  is  on  when  it 
is  needed.  This  differs  considerably  from  asking,  say, 
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about  the  fraction  of  all  time  that  the  computer  is  on. 

In  ether  words,  the  assumption  is  usually  made  that  the 
mission  time  is  very  much  less  than  the  lifetime  of  the 
system,  and  that  the  probability  of  successfully  complet¬ 
ing  the  mission  given  that  the  system  is  on  at  the  start 
is  almost  unity. 

The  reader  whose  particular  application  won't  permit 
this  assumption,  may,  as  a  rule,  answer  the  question, 

"What  is  the  probability  that  the  system  is  on  at  time 
and  remains  on  until  t2?"  by  first  finding  the  probability 
of  being  on  at  t^,  then  multiplying  by  the  probability 
that  no  failure  occurs  in  the  interval  This  can 

be  done  for  the  majority  of  the  examples  given  here  because 
the  relevant  stochastic  processes  are  Markov  processes,^ 
and  in  all  cases  the  pertinent  probability  density  func¬ 
tion  or  distribution  function  will  be  given. 

1.  DEFINITIONS 

Only  three  definitions  are  required  for  the  work 
which  follows. 

o  Reliability.  The  probability  that  a  device  will 
perform  its  purpose  adequately  for  the  period  of 
time  intended  under  the  operating  conditions 
encountered. * 


f 

A  Markov  process  has  the  property  that  the  future 
of  the  process  is  only  dependent  on  its  present  state  and 
not  on  the  time  history  of  the  process  up  to  the  present. 

*Radio-Electronics-Television  Manufacturers  Associa¬ 
tion,  1955. 


t 
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°  Availability.  The  probability  that  the  system  can 
operate  within  the  tolerances  at  a  given  instant 
of  time.t 

o  Interval  reliability.  The  probability  that  at  a 
specified  time,  the  system  is  operating  and  will 
continue  to  operate  for  a  duration,  say  x.  Thi 
continued  operation  during  the  interval  is,  of 
course,  to  be  achieved  without  benefit  of  repair 
or  replacement  [3], 

Many  more  definitions  might  be  useful,  particularly 
those  related  to  the  fraction  of  time  the  machine  operates. 
Most  of  these  may  be  found  in  [2], 


2.  THE  SCOPE  OF  THE  REPORT 

Below  is  an  annotated  guide  to  the  main  topics  of 
this  report. 

Theoretical  Performance.  This  is,  in  most  cases,  -a  com¬ 
putation  and  evaluation  of  the  availability  of  a  system. 
Knowledge  of  service  and  part  failure  distributions  is 
assumed  (and  thus  the  word  "theoretical").  The  resulting 
systems  distributions  are  computed  and  from  these,  the 
availability,  P(t) ,  is  computed.  In  many  cases,  the 

asymptotic  availability,  P  =  lim  P(t),  is  used  instead 

t-» 

of  P(t).  The  background  and  derivation  is  given  in 
Appendix  A  instead  of  in  the  main  text,  and  certain 
selected  results  are  used  in  Chapters  III  and  IV.  The 
central  problem  is:  Given  parts  of  specified  reliability, 
how  should  a  computer  be  constructed  to  attain  a  desired 
system  reliability?  Subsequent  chapters  discuss  achieving 

^Hosford  [2],  This  is  more  formally  known  as  "point- 
wise  availability,"  and  sometimes  the  "readiness." 
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this  reliability  through  redundant  techniques  and  multiple 
computers . 

Component  Reliability.  Chapter  II  treats  the  problem  of 
part  failure.  The  word  "part"  is  preferred  over  "com* 
ponent"  since  a  component  in  this  day  and  age  may  actually 
consist  of  a  large  collection  of  parts  which,  from  a 
service  standpoint,  are  indivisible.  The  origin  of 
failures  and  their  modes  and  mechanisms  are  examined  and 
a  large  number  of  life- test  statistics  are  analyzed. 

This  analysis  provides  the  basis  for  conclusions 
about  present  and  predicted  reliability  levels  of  parts, 
and  how  the  reliability  of  a  part  relates  to  its  price. 

From  these  data,  particularly  in  Sections  II-6  to  II-8, 
it  should  be  possible  to  extract  an  estimate  of  the  relia¬ 
bility  of  a  particular  part  on  a  simple,  albeit  very  large, 
digital  computer.^ 

The  numerous  details  involved  in  part  reliability 
appear  in  Appendices  B,  C,  and  D.  The  ,  the  relation¬ 
ship  between  the  behavior  of  a  part  and  the  stress  to 
which  it  is  subjected  is  discussed  for  semiconductor 
devices,  resistors,  and  capacitors,  respectively. 

Appendix  E  presents  a  complete  tabular  compendium  of 
the  life-test  failure  statistics  which  were  compiled 
during  this  study. 


By  "simple,"  we  mean  that  no  circuit  tricks  have 
been  used  to  increase  the  reliability,  e.g.,  redundancy. 
Namely,  a  simple  computer  is  assumed  to  consist  of  a 
large  series-chain  of  fallible  parts. 
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Circuit  Design.  Since  it  takes  more  than  reliable  parts 
to  make  a  reliable  computer,  Chap.  Ill  considers  the  prob¬ 
lem  of  obtaining  reliable  circuits.  Chapter  III  presents 
an  example  of  design  philosophy  which,  if  followed  care¬ 
fully,  should  go  a  long  way  toward  guaranteeing  a  good 
design . 

This  chapter  also  discusses  logical  design  and  reports 
briefly  on  progress  in  reducing  logical  design  errors  by 
computer-aided  techniques. 

Chapter  III  introduces,  as  a  tool  to  increase  relia¬ 
bility,  the  technique  of  part  redundancy.  Good  parts  and 
good  design  are  both  necessary  but  unfortunately  not  suf¬ 
ficient  to  insure  a  reliable  computer,  so  it  is  frequently 
necessary  to  resort  to  additional  methods  such  as  re¬ 
dundancy.  The  analysis  appears  in  Appendix  A,  but  the 
results  are  in  this  chapter,  with  some  indication  of  what 
these  methods  can  produce  as  a  function  of  present  and 
predicted  part  reliability. 

Chapter  III  discusses  mean*  of  failure  detection. 

After  all  practical  steps  have  been  taken  to  obtain  a 
reliable  system,  failures  still  occur;  the  problem  then 
is  to  discover  the  failures. 

Finally,  III  shows  that  the  notion  of  a  "failure"  needs 
considerable  refinement. 

Multiple  Computers.  Many  systems  rely  on  the  use  of 
multiple  computers  to  obtain  reliability.  Chapter  IV 
introduces  the  use  of  spare  computers  to  accomplish  this 
goal,  and  evaluates  the  availability  of  such  systems  under 
different  forms  of  service  and  system  configuration. 

Chapter  IV  also  discusses  the  relatively  new  concept  of 
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the  multi-processor,  and  compares  the  performance  of  the 
nonredundant ,  redundant,  duplex,  and  multi-processor. 

Finally,  it  considers  the  problem  of  additional  programming 
and  hardware  for  the  case  of  the  multi-processor. 

Systems  Considerations.  Chapter  V  takes  up  some  of  the 
factors  that  are  important  for  matching  the  data  proces¬ 
sing  system  to  its  environmemt.  Heading  the  list  is  the 
problem  of  writing  error- free  programs.  Suggestions  are 
made  as  to  how  this  might  be  accomplished  and  further 
analysis  is  made  of  how  such  programs  can  be  checked  (no 
trivial  problem  in  the  case  of  real-time  control  processes) 
and  corrected  when  errors  are  found. 

Then  such  topics  as  interconnection  and  packaging 
reliability,  performance  in  the  face  of  extreme  environ¬ 
ments  such  as  shock  and  radiation,  preventive  maintenance, 
and  quality  control  are  briefly  discussed. 

Maintenance.  The  central  problem  discussed  in  Chapter  VI 
is  discovering  what  to  repair  when  the  computer  malfunctions. 
Current  techniques  in  fault  diagnosis  and  isolation  will 
be  presented  and  some  estimates  will  be  made  of  how  ef¬ 
fective  and  costly  they  are  likely  to  be.  This  is  a  very 
important  subject,  since  availability  will  be  shown  to 
depend  strongly  on  the  mean  repair  time',  which  can  be 
drastically  reduced  if  enough  effort  is  expended  on  the 
problems  of  self-diagnosis. 


-8- 


Chapter  II 

THE  PROCESS  OF  FAILURE 


1.  INTRODUCTION 


A  part  will  be  defined  here  in  the  usual  sense  as  the 
smallest  replaceable  element  in  a  system.  In  other  litera¬ 
ture,  parts  are  referred  to  as  piece-parts,  component  parts, 
or  components.  Typical  parts  are:  resistors,  transistors, 
and  integrated  circuits.  Note  that  the  smallest  field- 
replaceable  element  is  a  maintenance  module,  usually  con¬ 
sisting  of  several  parts,  as  described  in  Chap.  III. 

This  study  considers  only  electronic  parts  for  two 
reasons . 

o  Electromechanical  devices  used  in  computing  systems 
are  assemblies  containing  many  non-standard  parts 
for  which  there  is  no  orderly  body  of  performance 
and  failure  data  such  as  exists  for  electronic 
parts. 

o  The  notorious  unreliability  of  such  electromechanical 
devices  (typewriters,  magnetic  tape  transports,  etc.) 
seems  to  preclude  their  use  in  any  function  directly 
essential  to  the  primary  system  mission,  if  extremely 
high  reliability  is  required. 

Furthermore,  of  the  totality  of  electronic  parts 
categories  in  existing  and  proposed  computers,  only  a  rather 
small  subset  will  be  examined  in  detail. 

The  major  reasons  for  inclusion  of  a  parts  category 
for  consideration  in  this  study  are: 

o  High  probability  that  the  parts  will  be  used  in 
computers  of  the  type  under  consideration. 
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o  High  probability  chat  the  number  cf  parts  per 

system  and  predicted  individual  p*rt  reliabilities 
will  contribute  significantly  to  system  failure 
rates . 

o  High  probability  that  there  are  or  will  be  no 
better  parts  for  the  required  part  functions, 
within  reasonable  cost  and  time  limitations. 

Selected  categories  are  listed  below,  as  are  many 
notable  exclusions,  with  appropriate  comments. 

Silicon  planar  transistors- - In  ground  applications, 
temperature  is  the  most  significant  stress  factor  in 
semiconductor  failure.  Germanium  exhibits  asymptotic 
failure  behavior  at  junction  temperatures  of  about 
150°C,  whereas  the  critical  temperature  for  silicon 
devices  is  more  like  350°C.  The  cost  of  silicon 
transistors  is  still  significantly  higher  than  ger¬ 
manium,  but  progressively  increasing  silicon  demand 
will  narrow  the  gap.  Also,  it  is  not  reasonable  to 
compare  cost  of  MIL-quality  silicon  to  entertainment 
or  commercial  quality  germanium. 

The  planar  process  is  selected  as  it  will  most 
likely  be  the  only  surviving  silicon  fabrication 
process.  Failure  data  for  mesa  devices  can  be  in¬ 
cluded  with  planar  data  for  extrapolations,  and,  with 
some  caution,  data  on  other  transistor  types  may  be 
included . 

Silicon  planar  diodes--Essentially  the  same  reasoning 
as  above. 

Monolithic  silicon  integrated  circuits--For  digital 
applications,  the  monolithic  circuit  is  better  than 
the  hybrid  thin-film  circuit  because  it  entails  fewer 
nondif fused  intraconnections,  less  handling  in  general, 
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anci  Lower  cost.  Reduction  to  practice  of  laboratory 
methods  for  thin- film  active  device  fabrication  may 
shift  the  balance  in  favor  of  the  monolithic  circuits, 
but  this  hasn't  happened  yet. 

Resistors--There  are  five  types  of  resistors  in 
general  use.  These  are:  composition  carbon,  carbon 
film  (molded  deposited  carbon),  tin  oxide,  metal  film, 
and  wirewound.  Computers  use  few,  if  any,  wirewound 
resistors,  so  this  category  may  be  ignored. 

Capacitors- -Dipped  mica  and  glass  capacitors  will  be 

4 

considered  for  the  range  1-10  picofarads.  Tabular 

3 

paper  or  plastic  capacitors  cover  the  range  10 

10^  picofarads,  and  tantalum  and  aluminum  electrolytic 

6  11 

capacitors  are  used  for  the  range  10  -  10  pico¬ 

farads.  A  relatively  new  type,  the  multilayer  ceramic 
capacitor,  covers  the  range  10  -  10  picofarads;  lack 
of  reliability  information  and  probable  economic  in¬ 
efficiency  preclude  consideration  of  this  type. 

In  the  most  fundamental  philosophical  sense,  a  part  has 
failed  if,  upon  the  future  application  of  some  combination  of 
normally  expected  stresses,  one  or  more  of  the  parameters 
of  the  part  would  vary  in  such  a  way  that  the  functional 
assembly  containing  the  part  would  become  incapable  of 
performing  its  function. 

This  "philosophic  failure"  is  as  academic  as  the 
question  of  the  sound  of  an  explosion  in  the  uninhabited 
desert,  and  a  more  practical  definition  might  be: 

A  part  has  failed  when,  under  some  combination  of 
normally  applied  stresses,  one  or  more  parameters  of  the 


- 11- 


part  vary  in  such  a  way  that  the  functional  group  con¬ 
taining  the  part  becomes  incapable  of  performing  its 
function. 

From  the  above  definition,  it  can  be  seen  that  it  is 
difficult  to  establish  a  definition  of  part  failure  which 
is  independent  of  the  nature  of  the  functional  group 
(e.g.,  "circuit")  containing  the  part.  Circuits  can  be, 
and  have  been,  designed  to  tolerate  quite  large  variations 

f 

in  part  parameters.  Also,  circuits  can  be,  and  have  been, 
designed  to  continue  operating  even  when  some  number  of 
parts  have  suddenly  assumed  limit  values  ("open"  or  "short" 
circuit) . 

Modifying  an  arbitrary  definition  of  failure  can 
significantly  affect  the  relati\  -  merits  of  various  types 
of  parts.  Consider  the  following  example,  based  on  manu¬ 
facturers'  published  data  [1],  Suppose  very  large  samples 
of  metal  film  and  composition  carbon  resistors  are  placed 
on  high- temperature  load  life  test  for  several  thousand 
hours.  If  failure  is  defined  as  "resistance  variation 
more  than  2  per  cent  from  nominal,"  then  nearly  all  of  the 
composition  resistors  will  "fail,"  but  almost  none  of  the 
film  resistors.  If  failure  is  defined  as  "resistance 
variation  more  than  30  per  cent  from  nominal,"  then  the 
ratio  of  film  resistor  "failures"  to  composition  resistor 
"failures"  will  be  infinite,  as  none  of  the  composition 
units  will  "fail,"  while  a  few  of  the  film  units  will  open. 


For  instance,  -100  to  1600  per  cent  from  nominal 
in  transistors,  -50  to  infinite  percentage  on 

electrolytic  capacitors,  and  h^g  of  transistors,  and  ±30 
per  cent  on  resistors. 
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At  the  part  level,  the  usual  practice  is  to  classify 
relatively  gradual  or  continuous  parameter  variations  with 
stress  (time  included)  as  "degradation"  and  reserve  the 
term  "failure"  (usually  with  the  superfluous  adjective 
"catastrophic")  for  the  relatively  rapid  or  discontinuous 
passage  of  a  parameter  to  a  limit  value  (open,  short)  or 
a  value  outside  some  statistically  predicted  bounds  for 
degradation  behavior. 

To  avoid  the  qualitative  judgment  implied  by  the  words 
"relatively  rapid  or  discontinuous,"  they  may  be  deleted 
from  the  definition  of  failure.  This  change  does,  however, 
introduce  "inadequate  prediction  of  degradation  bounds"  as 
a  failure  mechanism. 


Whether  in  operational  use,  under  test,  or  on  the 
shelf,  a  part  is  characterized  by  a  set  of  parameters  p^, 

i=l , 2 , 3 , . . . ,m,  and  subjected  to  stresses  s  ^  ,  j =1 , 2 , 3 . . 

where  time  is  included  as  a  stress.  The  parameters  are 
functionally  related  to  the  stresses  by  writing  p^.=f^(Sj), 
where  the  form  of  the  f^  may  be  quite  complicated  for  even 
the  simplest  parts  and  in  many  instances  unknown.  In  the 
degradation  range,  which  might  be  loosely  defined  by 
H.sp.zL.,  where  the  H.  and  L.  may  be  constants  or  functions 

ill  l  l 

of  the  s j ,  efforts  are  usually  made  to  find  approximate 

forms  p.  a:  Ff..(s.)  2;  £a..s.  where  a.,  are  coefficients  of 
J-  J  d  J  ij 

the  first-order  terms  in  the  series  expansion  of  the  f ^ , 

and  represent  such  often-used  coefficients  as  "temperature 

coefficient  of  resistance,"  and  "voltage  coefficient  of 

resistance."  When  the  f..  are  not  so  tractable,  bounds  may 

be  given,  such  as  variation  at  100  per  cent  rated  load, 

1000  hr,  +2  per  cent  to  -4  per  cent  maximum.  For  practical 
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application,  bounds  of  this  type  are  essentially  statistical 
limits  for  any  part  which  may  be  used. 

The  point  to  be  made  is  that  good  information  rela¬ 
tive  to  degradation  of  all  parameters  is  available  for 
most  types  of  parts  which  are  under  consideration  here. 

If  the  predicted  degradation  behavior  of  a  certain  type 
of  part  is  inadequate  for  the  proposed  application  after 
ail  circuit  efficiency  tradeoffs  have  been  considered 
(Chap.  II),  a  better  »?.?rt  must  be  selected. 

More  detailed  examination  of  the  meaning  and  origin 
of  "failure"  is  necessary  because  the  distinction  between 
degradation  and  (catastrophic)  failure  is  quite  arbitrary. 
Setting  =  +  ®,  =  -  ®  categorically  reduces  the 

failure  rate  to  zero  for  all  components,  but  leaves  the 
circuit  designer  very  little  to  work  with.  Consider,  for 
instance,  the  following  cases  of  failure. 

Failure  occurs  if  a  combination  of  s.  exists  such 

J 

that  one  or  more  p.>H.  or  one  or  more  p,<L..  This  situa- 

li  ri  1 

tion  in  turn  has  several  interpretations.  If  the  combina¬ 
tion  of  s j  was  expected  (within  absolute  bounds,  or 
statistically  predicted)  then  either  the  limits 
were  assigned  with  the  knowledge  that  the  failure  possi¬ 
bility  existed,  or  in  ignorance.  Assignment  of  ±2  per 
cent  limits  to  a  film  resistor  with  the  knowledge  that  an 
"open"  can  occur  is  nevertheless  reasonable,  as  it  would 
not  be  practical  to  assign  -2  per  cent.  -I-  ®.  If  the  com¬ 
bination  of  s.  was  unexpected,  then  either  the  absolute 
bounds  were  wrong  or  a  statistically  unlikely  combination 
occurred. 
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Failure  also  may  occur  if  some  previously  unknown  or 
unexpected  stress  appears  (some  s^,  k>n)  or  some  new 
functional  relationship  appears  between  the  expected  s. 
and  the  p . . 

Finally,  there  may  be  some  truly  random  failure 
mechanisms,  even  some  which  do  not  obey  known  causal 
relationships . 

Currently  the  problem  of  reliability  of  electronic 
equipment  is  approached  either  through  statistical  analysis 
and  synthesis  or  physical  reasoning. 

The  statistical  approach  is  essentially  a  carry-over 
into  electronics  from  other  fields,  notably  mechanical 
engineering.  Basic  to  this  method  is  the  assumption  that 
the  gathering  of  sufficient  field  and  test  data  on  samples 
of  parts  manufactured  by  some  relatively  constant  process 
permits  fitting  some  mathematical  functions  to  failure 
behavior.  The  classic  example  yields  the  three-phase 
failure  rate  curve  (sometimes  called  the  "bathtub," 
probably  by  the  same  individuals  who  refer  to  the  normal 
density  function  as  the  "bell")  which  is  a  composite  of 
a  decreasing  failure  rate  for  time  near  zero  (early  failure 
or  "infant  mortality") ,  a  constant  failure  rate  for  some 
intermediate  period  and  finally,  an  increasing  failure 
rate  at  some  later  time  (the  process  of  "wearout"). 

Wearout  failures  result  from  physical  or  chemical 
changes  with  time,  temperature,  and  other  stresses,  which 
either  cause  a  parameter  to  exceed  a  degradation  bound 
(as  in  oxidation  of  a  metal  film  resistor)  or  to  go  sud¬ 
denly  to  a  limit  value  (as  in  the  work-hardening  and 
eventual  breakage  of  a  relay  armature  spring). 
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The  origin  of  failures  whose  failure  rate  is  constant 
is  not  so  clear.  In  mechanical  systems,  random  failures 
have  been  said  to  originate  from  simultaneous  combinations 
of  stress,  randomly  and  independently  occurring,  which 
exceed  the  strength  of  the  part.  This  implies  that  either 
all  stress  limits  were  not  known,  or  the  part  was  inten¬ 
tionally  not  designed  to  stand  all  combinations  of  stress. 
In  electronic  work,  it  is  usually  possible  to  design  parts 
which  simultaneously  tolerate  all  anticipated  maximum 
stresses.  If  this  is  done  on  an  absolute  basis  (so-called 
"worst-case"  design),  then  only  ignorance  of  the  limits 
would  permit  random  failure  from  stress.  If  design  is  done 
on  a  statistical  basis,  a  random  failure  mechanism  exists. 

There  are  true  random  mechanisms  affecting  very  small 
electronic  parts,  notably  semiconductor  devices,  at  the 
atomic  and  subatomic  level.  Various  types  of  crystal  de¬ 
fects  may  be  caused  by  internal  statistical  behavior  as  a 
function  of  time  and  temperature,  and  the  ambient  radiation 
environment.  Under  normal  circumstances  (e.g. ,  55°C, 
ambient  temperature  and  no  recent  nearby  nuclear  explosion) 
mathematical  estimates  of  these  effects  put  them  several 
orders  of  magnitude  below  various  macroscopic  effects  as 
failure  modes  [2~. 

Early  in  the  history  of  reliability  analysis,  when 
the  state  of  the  part  manufacturing  art  was  not  so  well 
advanced,  straightforward  life- testing  of  samples  of  parts 
proc.  .ced  sufficient  failures  to  permit  reasonably  good 
curve- fitting  and  other  estimates  of  failure  behavior. 

In  six  months  or  so,  a  sample  of  several  hundred  incan¬ 
descent  lamps  could  be  run  until  nearly  all  had  failed. 
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By  contrast,  a  sample  of  a  thousand  semiconductor  devices 
might  be  run  for  a  year  with  two,  one,  or  even  zero  failures. 

Two  approaches  to  the  problem  of  insufficient  data 
have  been  used--larger  sample  sizes  and  accelerated  life 
testing.  Ever-increasing  sample  sizes  produce  more  failure 
data,  but  at  proportionate  cost  increase.  The  hypothesis 
underlying  accelerated  life  testing  is  as  follows:  if  the 
incidence  of  failure  and  the  level  of  a  particular  stress 
are  functionally  related  in  some  well-behaved  manner,  the 
failure  rates  observed  at  two  or  more  high  stress  levels 
may  be  used  to  predict  failure  rates  at  lower  stress  levels 
by  extrapolation.  Many  researchers  have  shown  excellent 
correlation  between  observed  failure  data  and  simple  func¬ 
tions  of  absolute  temperature  for  resistor  and  semicon¬ 
ductor  devices  [31.  Others,  however,  claim  that  the  ob¬ 
served  relations  hold  only  at  the  higher  temperatures,  and 
that  the  extrapolation  back  to  lower  temperatures  is  mean¬ 
ingless,  as  the  high- temperature  failure  modes  (e.g., 
oxidation  and  phase  changes)  are  virtually  non-existent, 
while  other  modes  exist  which  are  not  acceleratable  by 
increased  ambient  temperatures.^ 

With  respect  to  the  procedure  of  increasing  sample 
sizes  to  obtain  useful  numbers  of  failures  in  reasonable 
times,  the  occurrence  of  certain  new  failure  modes  makes 


+ 

A  major  argument  against  step-stress  (accelerated) 
testing  has  been  the  discrepancy  between  results  obtained 
with  power-stressing  and  high  temperature  (unpowered) 
stressing  [4,5].  Recent  information  [6]  indicates  that 
incorrect  determination  of  junction  temperature  in  power- 
stress  tests  may  account  for  the  observed  differences. 
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it  doubtful  whether  extrapolation  to  failure  character¬ 
istics  of  single  units  is  valid.  In  other  words,  there 
is  reason  to  believe  that  the  failures  do  not  constitute 
a  Markov  process  and  that,  as  we  test  fo  a  longer  time, 
new  failure  modes  occur.  It  would  certainly  be  question¬ 
able,  therefore,  to  drav.  conclusions  about  the  behavior 
of  a  single  device  over  100,000  hr,  based  on  observations 
of  100,000  devices  for  one  hour. 

Another  consequence  of  the  "brute-force"  testing 
method  is  that  the  most  and  best  information  is  available 
on  the  oldest,  i.e.,  most  obsolescent,  parts. 

All  of  the  above  considerations  have  led  to  the 
evolution  of  the  phyoics-of-failure  approach  to  reliability 
prediction.  This  approach  lists  and  classifies  all  signifi 
cant  part  failure  mechanisms,  and  establishes,  on  a  theo¬ 
retical  basis,  the  functional  relation  to  stress.  Deter¬ 
mination  of  possibly  significant  mechanisms  originates 
from  observed  failure  modes  or  from  pure  physical  reason¬ 
ing.  Where  possible,  functional  relations  are  correlated 
to  observed  data  to  confirm  hypotheses.  If  individually 
structure-related  and  material-related  failure  mechanisms 
are  isolated,  and  functional  stress  relations  established, 
the  same  relations  may  be  carried  over  to  different  parts 
produced  by  similar  processes,  without  rDpeacing  extensive 
tests . 

Cautious  application  of  physics-of- failure  techniques 
can  actually  identify  mechanisms  which  are  amenable  to 
accelerated  testing,  which  in  turn  permits  verification  of 
hypotheses,  at  least  for  the  well-behaved  cases  [73. 
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2.  FAILURE  MODES  AND  MECHANISMS 


Five  terms  relating  to  failure  behavior  need  definition. 

Indication--The  external  observation  that  a  part 
parameter  has  changed  to  a  value  outside  the  estab¬ 
lished  degradation  bounds,  often  to  a  limit  value 
(e.g.,  open,  short,  no  output). 

Mode- -The  internal  occurrence  which  causes  the 
indication. 

Mechanism- -The  physico-chemical  process  underlying 
the  mode. 

Stress--Any  characteristic  of  the  environment  which 
causes  or  allows  the  mechanism  to  proceed. 

Origin--That  aspect  of  the  materials  and  processes 
of  fabrication  of  the  part  which  allows  a  combina¬ 
tion  of  stress  and  mechanism  to  result  in  a  mode. 

Application  of  the  terms  is  exemplified  in  the  two 
cases  of  integrated  circuit  failure  shown  in  Table  II-l. 

These  two  examples,  taken  by  themselves,  show  that  the 
origins  of  failure  fall  into  two  classes. 

o  Fundamental  design  inadequacies  resulting  from 
characteristics  and  limitations  of  materials, 
processes,  procedures,  equipment,  and  personnel. 

o  The  degree  to  which  potential  capabilities  of 
the  design  are  realized  in  actual  fabrication 
(production  engineering,  process  control,  quality 
control) . 

A  third  factor  which  affects  failure  in  a  "post  facto" 
sense  is  removal  or  actual  ox  potential  defectives, 
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Table  II-l 

EXAMPLES  OF  INTEGRATED  CIRCUIT  FAILURE 


Case  1 

Case  2 

Indication 

Open  input  circuit 

High  reverse  current 

Mode 

Thermocompression  bond 
off  at  aluminum  pad 

Excessive  moisture 
in  can 

Mechanism 

Formation  of  inter- 
metallic  compound 
("purple  plague") 

Ionic  conduction  on 
surface  of  semi¬ 
conductor 

Stress 

Temperature,  time 

Ambient  air  pressure, 
humidity,  time 

Origin 

Use  of  bimetallic 
gold-aluminum  system 
in  presenc2  of  silicon 

Faulty  weld,  permit¬ 
ting  air  leakage 
into  can 

regardless  of  origin,  before  use  in  equipment  or  before  the 
equipment  is  declared  operational.  The  processes  involved 
include : 

Inspection  and  non-destructive  test--Visual ,  electrical, 
radiographic,  infra-red,  mechanical,  package  leakage, 
and  other  tests,  conaucted  either  on  a  sampling  or  100 
per  cent  basis. 

Destructive  test  or  test-to-failure--Usually  step-stress 
tests  using  temperature,  power,  vibration,  or  acceler¬ 
ation.  Of  nec^3sity  these  tests  are  conducted  on  a  lot 
sampling  basis  only. 
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Screening  and  burn-in--Subj ection  of  parts  tc  one  or 
more  stresses  carefully  selected  so  as  to  greatly 
accelerate  decreasing  failure  rate  mechanisms  without 
significantly  affecting  any  constant  or  increasing 
rate  mechanisms.  Examples  are  high  temperature  bake, 
temperature  cycling,  acceleration,  and  operation  in 
test  circuits.  Burn-in  is  usually  conducted  on  a 
100  per  cent  basis. + 

Debugging- -Rep lacing  early  failures  as  they  occur 
during  equipment  checkout--a  necessary  procedure  which 
operates  on  a  100  per  cent  basis  and  is  essentially  a 
form  of  burn-in.  Debugging  is  inferior  to  laboratory 
burn-in,  however,  as  the  parts  in  actual  equipment, 
hopefully,  are  derated  (understressed)  while  well 
constructed  burn-in  tests  often  apply  carefully 
selected  overstresses.  Physically  small  (e.g,,  air¬ 
borne)  computers  may  be  burned- in  by  operating  in 
high  stress,  usually  high- temperature ,  environment, 
but  this  is  still  an  undesirable  procedure,  as  the 
least-derated  parts  would  have  to  be  grossly  over¬ 
stressed  if  the  most-derated  parts  are  to  be  signifi¬ 
cantly  stressed.  Furthermore,  practical  aspects  of 
checking  out  a  large-scale,  ground-based  computer  at, 
say,  125°C  for  a  week,  leave  something  to  be  desired. 

^"Burn-in"  is  the  process  of  actually  operating  the  part 
in  a  controlled  environment  for  a  length  of  time  that,  hope¬ 
fully,  will  weed  out  parts  that  were  destined  for  early  fail¬ 
ure.  Effectiveness  of  the  burn-in  procedure  has  been  demon¬ 
strated  by  the  Apollo  project  in  which  burn-in  procedures 
requiring  approximately  15  days  have  produced  0.2-0. 3  per 
cent  early  failures,  whereas  17,000  parts  surviving  burn-in 
have  operated  for  an  average  of  1.20  days  per  part  (49  million 
unit  hours)  with  rj£  subsequent  failures. 
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To  date,  most  large-scale  reliability  testing  and 
theorizing  has  been  conducted  in  support  of  airborne, 
spaceborne,  missile,  and  shipboard  systems.  These  are 
usually  small  computers  designed  for  high  probability  of 
survival  over  mission  times,  or  times  between  maintenance, 
of  minutes  to  a  few  hundred  days.  The  concept  of  a  "useful 
life"  or  "design  life,"  of  several  years,  in  the  few  cases 
where  applicable,  is  nullified  by  the  probability  of  tech¬ 
nological  obsolescence.  Most  manufacturers  of  large 
commercial  computers  do  not  support  massive  electronic 
parts  reliability  programs,  for  several  reasons. 

o  Reasonable  extrapolation  can  be  made  from  military 
data. 

o  Parts  manufacturers  supply  "computer  grade"  parts 
made  by  processes  similar  or  identical  to  those 
used  in  military  parts,  but  without  the  qualifying 
paper  burden. 

o  In-house  parts  reliability  programs  cost  too  much. 

o  Electro-mechanical  peripheral  equipment  failure 
overshadows  electronic  parts  failure  in  most  com¬ 
mercial  systems. 

o  Design  life  is  most  likely  limited  by  techno¬ 
logical  obsolescence. 

Exceptions  to  the  above  might  be  noted  for  real-time  in¬ 
dustrial  control  systems  of  hazardous  processes  (steel 
mills,  refineries),  and  communications  processing  where 
errors  and  delays  incur  significant  costs  (stock  market 
quotations  systems,  on-line  teletype  data  processors, 
subscription  television  billing  computers).  Other  large- 
scale  ground-based  systems  requiring  long  design  life  are 
the  military  warning,  strategic,  and  tactical  machines. 


-22- 


The  need  for  long  life  in  military  computers  comes  about 
primarily  because  the  military  purchases  the  computer  and 
the  manufacturer  makes  no  provision  to  offer  the  military 
a  "new  model"  every  three  or  four  years. 

The  outstanding  civilian  exception  in  the  realm  of 
non-military  reliability  is,  of  course,  the  Bell  Telephone 
System.  Typical  of  the  fundamental  telephone  company 
attitude  toward  reliability  is  the  design  life  target  for 
solderless,  wire-wrapped  connections  used  in  central 
exchanges:  40  years.  This  significant  difference  in  re¬ 

quired  lifetime,  plus  the  dynamic  state  of  parts  technology, 
calls  for  critical  review  of  failure  theory  and  data  [8], 

At  the  parts  level,  the  most  significant  question  is-- 
are  there  any  increasing  failure  rate  mechanisms  in  any 
of  the  part- types  that  may  be  selected;  and  if  such  mech¬ 
anisms  exist,  do  they  exhibit  asymptotic  or  sharply  peaked 
behavior  at  times  within  the  design  life  of  the  computer? 
Consider  the  (unlikely)  hypothesis  that  operation  of  the 
system  depends  on  the  operation  of  a  large  number  of  in¬ 
candescent  indicator  lamps.  The  immediate  reaction  of  a 
design  engineer  might  be  to  suggest  use  of  the  available 
(at  a  price)  10,000-hr  lamps,  or  the  (at  a  higher  price) 
50,000-hr  lamps.  But,  from  the  economic  standpoint  of 
a  lamp  manufacturer,  lamp  life  should  be  normally  dis¬ 
tributed  about  the  advertised  value,  with  a  very  small 
variance.  This,  coupled  with  the  fact  that  a  reasonable 
10-year  design  life  contains  87,600  hr,  indicates  that  the 
intuitive  response  to  design  requirements  of  the  subject 
system  may  be  grossly  in  error. 
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If  empirical  evidence  or  prediction  based  on  theory 
shows  existence  of  increasing  failure  rate  mechanisms, 
there  are  still  two  possible  mitigating  situations  which 
allow  us  to  live  compatibly  with  such  mechanisms:  the 
moments  of  the  failure  distributions  are  such  that  reason¬ 
able  systems  readiness  may  still  be  achieved,  or  the 
mechanisms  are  distributed  non-uniformly  among  the  popu¬ 
lation  of  parts.  The  first  situation  is  straightforward; 
if  a  "wearout"  mechanism  makes  itself  known  at  50  years, 
as  might  be  the  case  for  solid  state  diffusion  (see 
Appendix  B) ,  it  will  probably  not  significantly  affect 
system  readiness  over  the  first  ten  years.  As  an  example 
of  these  statements,  Table  II-2  shows  the  mean  failure 
rate  over  the  first  ten  years  of  operation  for  parts  whose 
failure  distribution  is  normal  with  given  mean  and  standard 
deviation. ^ 

Evidence  exists  that  certain  increasing  failure  rate 
mechanisms  have  origins  which  are  not  uniformly  distributed 
among  the  part  populations.  Certainly  failures  like  case 
2,  in  Table  II- 1,  resulting  from  obvious  manufacturing 
defects,  are  in  this  class. 

The  effect  of  certain  contaminants  (metallic  particles, 
water  vapor,  etchant  residues)  in  transistor,  diode,  and 
integrated  circuit  packages  is  worthy  of  consideration  in 
this  respect.  It  may  be  that  contaminants  are  only  in- 
co-uded  in  a  small  percentage  of  the  population.  If  so, 
and  if  adequate  quality  control  maintains  or  improves  the 


Actually,  a  truncated  normal  distribution,  but  the 
left  tail,  for  t<0,  is  negligible  in  all  but  a  few  cases. 
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Table  II- 2 

NORMALLY  DISTRIBUTED  WEAROUT 


Mean  Wearout 
Life,  (years) 

Standard 

Deviacion 

(years) 

Mean  Failure  Rate 
over  10  years 

CA/ 10 3  hr) 

20 

5 

.026 

20 

10 

.181 

30 

5 

.00037 

30 

10 

.026 

40 

5 

<10- 10 

40 

10 

.0015 

50 

5 

OO 

H 

1 

O 

r— 1 

V 

50 

10 

.000037 

status  quo,  then  the  contribution  of  wearout  by  contaminant 
effects  to  overall  failure  rate  applies  in  the  same  small 
percentage.  Contaminants  may,  however,  unavoidably  be 
distributed  throughout  the  entire  population. 

A  uniform  distribution  of  contaminants  may  cause  a 
severe  wearout  hazard.  If  the  distribution  is  nonuniform 
(normal,  say),  the  hazard  may  be  less  severe,  particularly 
so  if  the  cause-effect  relation  between  the  contaminant 
concentration  and  the  failure  mechanism  is  nonlinear  or 
discontinuous  (exhibits  a  threshold). 
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The  preceding  discussion  was  hypothetical.  We  will 
now  attempt  to  pin  down  actual  failure  behavior,  although, 
unfortunately,  the  available  evidence  is  meager,  and  the 
many  conflicting  interpretations  only  confuse  the  situation. 

3.  MODELS  OF  FAILURE  BEHAVIOR 

Consider  first  the  ideal  device  manufactured  exactly 
in  accordance  with  its  design.  For  a  single  "ideal"  part 
(metal  film  resistor,  say),  if  there  is  one,  or  at  most  a 
few,  dominant  failure  mechanisms,  life  testing  plus  the 
physics  of  failure  (writing  of  a  theoretical/empirical 
equation  for  each  mechanism  plus  isolated  parameter  de¬ 
termination  of  each  mechanism)  can  yield  useful  information. 

For  a  complex  part  (e.g. ,  transistors,  or  integrated 
circuits)  many  mechanisms  may  act  simultaneously  and  their 
degree  of  contribution  to  the  probability  of  failure  may 
vary  strongly  with  instantaneous  stress  values.  Further, 
the  nonhomogeneous  nature  of  the  materials  and  possibly 
less-well-understood  behavior  makes  the  relation  between 
physics-of- failure  equations  and  reality  even  more  tenuous. 

Consider  now  the  real,  as-built  device,  in  which  all 
the  statistical  variability  of  the  materials  and  manufactur¬ 
ing  processes  has  been  superimposed  on  the  parameters  of 
the  basic  design. 

It  can  be  categorically  stated  that,  in  early  life, 
for  complex  parts  (and  even  a  film  resistor  or  mica 
capacitor  may  be  complex  in  this  sense),  some  form  of 
decreasing  failure  rate  will  be  observi d  which  is  in  no 
way  correlated  to  predicted  behavior  of  the  ideal  design. 


The  rate  of  decrease  may  be  correlated  to  the  vendor’s 
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name  [9,101,  relative  newness  cf  the  product  or  process, 
or  even  the  material  of  an  assembler's  skirt  and  motion 
of  that  portion  of  the  assembler  inside  the  skirt  relative 
to  her  chair  [12].  This  behavior  is  essentially  the 
failure  of  "genetic  defectives"  under  normal  stress  or 
(in  the  case  of  burn-in)  intentional,  controlled  overstress. 

It  must  be  true  that,  in  later  (be  it  decades  or 
centuries)  life,  all  parts  will  show  increasing  failure 
rate  ("aging"  or  "wearout")  mechanisms.  At  any  non-zero 
temperatures,  solid  state  diffusion  affects  all  devices, 
semiconductors  most  significantly.  With  applied  voltages 
and  currents,  drift  and  migration  occur.  With  any  reactive 
elements  in  the  environment,  oxidation,  electrolysis,  and 
similar  actions  take  place.  And  with  any  cycling,  stresses, 
work-hardening,  fatigue,  crystallization,  and  defect  propa¬ 
gation  occur. 

In  the  period  when  the  earV  failure  rate  has  de¬ 
creased  to  zero  and  the  wearout  rate  is  still  insignificant, 
there  may  be  some  very  low-level,  truly  random  (non-causal) 
failure  mechanisms  operating. 

Over  the  duration  of  all  life  tests  performed  so  far, 
and  in  the  opinion  of  nearly  all  authorities  (e.g.,  Refs. 

13,  14,  15),  the  decreasing  failure  rate  behavior  dom¬ 
inates.  Certainly  this  is  true  for  tens  of  thousands  of 
hours,  and  probably  through  ten  years. 

The  few  dissenting  commentaries  point  out  existence 
of  shorter-term  wearout  mechanisms  ( intermetallic  phase 
formation,  anamalous  compound  formation,  inversion  due  to 
seal  leakage,  oxide  regrowth  at  windows)  but  the  effects 

+Also,  see  p.  i70  of  Ref.  11. 
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of  all  of  these  are  apparently  present  in  only  a  small 
percentage  of  the  total  population.  In  fact,  the  very 
presence  of  these  mechanisms  creates  the  long  "tail"  of 
the  decreasing  failure  rate  curve.  To  reiterate:  the 
total  part  population  shows  a  decreasing  failure  rate  be¬ 
cause,  and  only  because,  various  controllably  small  sub¬ 
groups  show  initially  increasing  failure  rates  of  various 
forms  until  every  member  of  the  subgroup  has  failed,  at 
which  time  the  failure  rate  of  the  entire  remaining  popula¬ 
tion  effectively  decreases. 

Adequate  shakedown  (checkout)  periods,  or  much  pref¬ 
erably,  carefully  designed  burn-in  procedures  can  locate 
the  operational  time  origin  far  down  on  the  decreasing 
rate  slope  without  intruding  on  the  upslope  of  the  long¬ 
term  true  wearout  mechanisms. 

After  burn-in,  the  constituents  of  the  composite 
failure-rate  curve  are  mostly  post-modal  tails  of  the 
subgroup  functions,  plus  any  very- low  percentage,  high 
variance,  high-mode  subgroup  functions,  plus  the  back¬ 
ground  true  random  rate,  plus  the  premodal  tails  of  the 
true  wearout  functions. 

Figure  II- 1  shows  a  synthesis  of  a  failure  distribu¬ 
tion  function  which  was  constructed  under  the  following 
hypotheses  . 

o  Wearout  mechanisms  affecting  the  entire  part 
population  have  a  mean  between  10  and  100  years. 

o  There  are  six  wearout  mechanisms  operating  on 
subsets  of  the  population,  with  means  of  30  to 
30,000  hr. 

o  The  mechanism  with  the  highest  mean  (about  30,000 
hr)  affects  the  smallest  (about  1.7  per  cent)  per¬ 
centage  of  the  population  and  mechanisms  with  pro¬ 
gressively  lower  means  affect  higher  percentages, 
with  the  30-hr  mechanism  affecting  10  per  cent 
of  the  group. 
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Fig  .31- 1 
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The  composite  cu"ve  shows  a  very  short  increasing  fail¬ 
ure  rate  period,  followed  by  a  decreasing  failure  rate  to 
4 

about  2x10  hr,  at  which  time  the  rate  has  decreased  to  zero. 
There  is  then  a  "golden  age"  from  10  to  lO''  hr  (about  ten 
years)  where  the  failure  rate  remains  zero,  or  more  properly 
is  reduced  to  the  true  random  background  rate.  At  10^  hr, 
the  universal  wearout  mechanisms  begin  to  take  effect. 

The  approximate  failure  rate  vs.  time  for  Fig.  II-l  is 
tabulated  in  Table  II-3. 

Table  II-3 

FAILURE  RATE  VS.  TIME  FOR  SYNTHETIC  FAILURE  DISTRIBUTION 

OF  FIG.  II- 1 


urn 

Failure  Rate 
(%/ 10 3  hr) 

(over  random  background) 

15 

170 

30 

250 

70 

105 

150 

40 

300 

17 

700 

5.8 

1,500 

A. 5 

3,000 

1.5 

7,000 

0.3 

15,000 

0,05 

30,000 

0.00 

70,000 

0.00 

150,000 

0.17 

300,000 

0.13 

700,000 

0.0A 
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The  time  scale  is  to  be  considered  as  "equivalent  un¬ 
stressed  hours,"  where  legitimate  accelerating  processes 
exist.  The  effect  of  burn-in  may  be  noted  by  "throwing 
away"  that  portion  of  the  cumulative  distribution  to  the 
left  of  the  equivalent  time  represented  by  the  burn-in 
process . 

Faced  with  evidence  of  multiple  early-wearout  mechanisms 
in  semiconductor  devices,  plus  the  likelihood  that  they  are 
much  less-well-behaved  in  reality  than  in  the  example  of 
Fig.  II-l,  it  is  difficult  to  see  how  efforts  to  fit  various 
popular  distributions  (notably  the  Weibull  and  the  log- 
normal)  can  succeed.  Some  attempts  at  fitting  are  given  in 
[16]  and  [17],  and  the  abandonment  of  an  heroic  attempt  is 
stated  in  [15].  Descriptions  of  the  distributions  used  in 
the  fitting  game  may  be  found  in  [18]. 

The  question  remains--how  can  system  reliability  be 
computed  without  theoretical  or  empirical  failure  distribu¬ 
tion?  There  seems  to  be  but  one  rational  answer:  When 
there  is  no  additional  information,  choose  a  distribution 
with  a  constant  failure  rate  (this  is  uniquely  the  exponen- 
tial  distribution  ).  This  constant  failure  rate,  A,  may  be 
obtained  by  computing  a  weighted  average  of  the  true  failure 
rate  over  the  design  life  and  defining  A  to  be  this  number, 
or  defining  A  to  be  piecewise  constant  over  the  interval  of 
interest.  These  constant  pieces  should  be  estimated  by  the 
educated  combination  of  all  the  available  processes--physics 
of  failure,  brute  force  life  tests,  accelerated  step-stress 
tests,  and  engineering  intuition. 

+If  T,  is  a  random  variable  denoting  the  time  to 
f 

failure,  then  Tf  is  exponentially  distributed  if  PrrTfstl 
=  e"*'*',  A  =  constant.  See  Appendix  A. 
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Eliminating  unreasonable  figures  from  the  mass  of 
available  data  purpoi  .  ig  to  indicate  inherent  fai  lure 
rates  requires  engineering  intuition  (plus,  possibly, 
some  detective  work).  There  are  two  types  of  offenders-- 
unreasonably  high  failure  rates  and  suspiciously  low 
failure  rates.  Causes  of  unusually  high  failure  rates 
include: 

o  Use  of  field  data  which  includes  secondary  failures, 
"homicides,"  i.e.,  destruction  of  parts  resulting 
from  human  error,  and  replacement  of  parts  which 
were  not  truly  defective.  Homicides  are  particu¬ 
larly  significant.  On  one  large  computer  (450,000 
semiconductor  devices)  the  transistor  failure  rate 

was  .0013%/ 10^  hr  in  one  quarter  and  .114%/ 10^  hr 
in  the  preceding  quarter.  Inquiry  produced  the 
following  approximate  statement  "Oh  yes,  accidents 
happen.  Just  last  month  a  man  dropped  a  probe  and 
took  out  100  modules."  (In  this  example,  then, 
human  error  introduced  about  two  orders  of  magnitude 
increase  in  reported  failure  rate.) 

o  Use  of  data  on  a  new  product  for  which  the  quality 
control  process  has  not  had  time  to  operate. 

o  Use  of  data  from  tests  conducted  at  stresses  sig¬ 
nificantly  higher  than  those  anticipated  in  the 
application. 

Causes  of  unusually  low  failure  rates  include: 

o  Extrapolation  from  accelerated  or  step-stress  tests 
where  it  is  not  clear  that  this  is  physically 
legitimate,  or  such  extrapolation  using  some  em¬ 
pirical  formula  in  a  range  far  from  that  in  which 
its  parameters  were  determined. 

o  Discarding  certain  failures  from  the  statistics  as 
errors  in  fabrication  or  for  other  reasons,  unless 
it  is  clearly  proven  that  the  origin  of  the  defects 
has  been  removed. 
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4.  ESTIMATES  OF  NON- CONSTANT  FAILURE  RATES 

The  best  available  evidence  and  opinion  indicates 
that,  for  the  parts  under  consideration,  there  are  no 
increasing  failure  rate  mechanisms  which  will  have 
significant  effects  in  the  first  few  decades  of  part  life. 

A  considerable  body  of  opinion  and  evidence  indi¬ 
cates  decreasing  failure  rate  behavior  of  many  part- types, 
notably  semiconduc tor  devices  and  certain  capacitors.  It 
is  not  clear  that  the  decreasing  failure  rate  behavior 
(observations)  is  indicative  of  the  existence  of  corres¬ 
ponding  decreasing  failure  rate  mechanisms.  As  stated 
earlier,  superposition  of  various  subgroup  short-term 
mechanisms  probably  ps.  ^duces  the  observed  group  behavior. 

Nevertheless,  if  the  group  behavior  is  as  observed, 
and  if  screening,  burn-in,  and  quality  control  cannot 
remove  "genetically  defective"  individuals,  a  decreasing 
failure  rate  may  have  to  be  seriously  considered  in  pre¬ 
dicting  system  reliability,  unless  the  rates  settle  to 
some  constant  or  near-constant  background  value  in  a 
small  fraction  of  the  design  life  (e.g.,  one  year  for  a 
ten- year  system). 

Reference  15  gives  one  of  the  largest  collections  of 
data  gathered  for  a  single  investigation  of  non-constant 
failure  rate  behavior.  Eight  manufacturers  contributed 
life  test  data  on  10,300  transistors  of  24  types.  Maximum 
test  time  was  1000  hr  for  all  but  two  types.  Most  tests 
were  at  reasonably  high  temperature  or  dissipation  levels, 
and  definitions  of  failure  were  arbitrary  limits  of  two 
ranges-- initial  and  end-of-life. 
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Failures  were  logged  at  various  elapsed- time  inter¬ 
vals.  The  Weibull  failure  distribution  was  assumed,  and 
attempts  were  made  to  compute  the  parameters  a  and  X  of 
the  Weibull  distribution  and  the  associated  confidence 
limits. +  After  discarding  high- power  and  unijunction 
transistor  data,  and  those  cases  for  which  insufficient 
failures  occurred,  Table  11-4  gives  what  meager  informa¬ 
tion  remains. 


Table  11-4 

ESTIMATED  PARAMETERS  FOR  WEIBULL  FAILURE  DISTRIBUTION 


Junction 

Temperature 

°C  (Operating 
or  storage) 

Weibull  Parameters^ 

Initial  Limit 
Failures 

Life  Test  Limit 
Failures 

Type 

a 

1/X 

a 

1/X 

2N652A 

100  sto 

1.15 

10.0 

1.00 

60.3 

2N652A 

100  opa 

0.66 

16.4 

- 

- 

2N705 

300  stc 

0.15 

45.0 

0.23 

77.0 

2N705 

100  opa 

0.53 

55.0 

- 

- 

2N718A 

200  opa 

- 

- 

0.29 

61.0 

2.N744 

175  opa 

0.56 

6.62 

- 

- 

2N962, 

964 

100  opa 

1.00 

50.4 

0.55 

66.7 

Estimated  from  dissipation  and  thermal  resistance. 
\/ith  t  in  thousands  of  hours. 


fIf  T^  is  the  time  to  failure,  then  T^  is  Weibull  dis- 

-X  t“ 

tributed  if  Pr{T^2t}  =  e  .  The  failure  rate  for  the 
Weibull  distribution  is  r(t)  =  Xata”^* 
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For  the  data  as  given,  enormously  large  failure 
rates  result.  Some  effort  is  necessary  to  extrapolate 
observed  values  to  more  reasonable  operating  conditions. 
Assuming  55°C,  ambient  and  15°C  rise  gives  an  operating 
junction  temperature  of  70°C  for  the  proposed  system. 
Acknowledging  the  hazards  of  the  process,  an  attempt  was 
made  to  produce  an  Arrhenius  extrapolation  from  the  data, 
assuming 

log  A (T 2 )  =  log  \CTj_)  -  Up¬ 
values  of  k  obtained  from  Refs.  2,  5,  17,  19,  and  20  were 
as  follows:  1.47,  2.20,  4.44,  4.99,  5.21,  5.70,  6.20,  6.23, 
19.1,  for  A  in  7,/ 1000  hr  and  reciprocal  temperature  in 
1000/T°C. 

The  results  or  this  extrapolation  to  70°C  resulted 
in  the  following  "best  1965"  estimates  for  the  part  failure 
rate,  if  a  Weibull  distribution  is  assumed. 

-0  4 

Transistor:  r(t)  =  .005t 

Diode:  r(t)  =  ,0025t 

-0  4 

Resistor  and  Capacitor:  r(t)  =  .00017t 


The  validity  of  these  failure  rates  is  marginal  at 
best.  Both  the  original  data  and  the  extrapolations  are 
suspect;  but  since  the  subject  of  estimating  decreasing 
failure  rate  keeps  arising,  the  authors  decided  to  in¬ 
clude  these  data. 
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5.  STORAGE  LIFE  VERSUS  OPERATING  LIFE 

If  there  is  only  a  periodic  demand  for  a  system,  it 
is  essential  to  consider  the  effect  on  reliability  of 
putting  the  system  in  some  standby  condition  between 
operating  periods.  "Standby"  might  mean  a  completely 
de-energized  state  or  a  carefully  designed  condition  where 
the  parts  are  subjected  to  some  optimal  environment. 

Effects  of  environment,  in  this  context,  on  various  part- 
types  will  first  be  considered. 

Composition  resistors  f 1]-- Humidity  and  voltage 
degradation  effects  are  largely  reversible.  Opera¬ 
tion  at  at  least  1/10  rated  dissipation,  or  in  a 
controlled-humidity  environment,  will  minimize 
humidity  effects.  Temperature  effects  are  partially 
reversible,  while  load-life  effects  are  relatively 
permanent.  The  optimum  standby  condition  would  be: 
low  ambient  temperature  and  humidity,  and  zero  power 
dissination. 

Film  resis tors-- Degradation  mechanisms  are  enhanced 
by  p^.;er  dissipation  and  temperature.  However,  a 
large  number  of  power- temperature  cycles  might  in¬ 
crease  the  probability  of  "open"  failure.  Optimum 
standby  condition:  probably  left  on,  if  well- derated. 

Capacitors  (all  non-electrolytic)--Life  is  a  sensitive 
function  of  voltage  and  temperature.  Optimum  stand¬ 
by  condition:  zero  volts  and  low  ambient  temperature. 

Electrolytic  capacitors-- Voltage ,  surge  current,  and 
temperature  decrease  operating  life,  but  some  forward 
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voltage  is  required  to  prevent  de- forming.  Optimum 
standby  condition:  low  ambient  temperature,  forming 
voltage  applied  through  current- limiting  resistor. 

Semiconductor  devices-- Early  and  long-term  mechanisms 
are  enhanced  by  voltage,  current,  power,  temperature, 
and  humidity.  Cycling  also  may  have  significant 
effects,  and  sensitivity  to  over-voltage  transients 
is  extreme.  Optimum  standby  condition:  low  ambient 
temperature  and  humidity,  zero  power  dissipation. 

Although  the  above  discussion  clearly  defines  an 
optimal  standby  condition  for  a  system,  there  are  some 
very  strong  arguments  against  standby  operation.  If  the 
system  must  be  energized  daily  or  more  often  for  self-check 
purposes,  the  potential  hazard  of  stress-cycling  and  un¬ 
controlled  transient  damage  must  be  very  carefully 
evaluated. 

There  is  a  considerable  economic  justification  (and 
pressure)  to  operate  the  system  for  routine  computation 
when  it  is  not  performing  its  primary  role.  Also,  most 
systems  (e.g.,  ballistic  missile  defense)  are  in  a  surveil¬ 
lance  mode  and  must  therefore  be  on,  although  operating 
well  below  capacity. 

If  the  temperature,  humidity,  and  power  derating  of 
the  parts  are  well-controlled.,  there  is  little  absolute 
difference  in  system  failure  rate  of  the  machine  when 
operating  and  when  it  is  on  standby-- perhaps  a  factor  of 
two,  at  most.  If  this  is  weighed  against  the  value  of 
the  system  capability  for  peripheral  tasks,  and  the  risk 
of  cycling  or  transient  effects,  there  does  not  seem  to 
be  any  clear-cut  advantage  in  the  standby  mode. 


-37- 


6.  COST-RELIABILITY  relationships 


Functional  relationships  between  inherent  reliability 
and  cost  are  usually  difficult  to  obtain  for  the  following 
reasons : 

o  Demonstrated  reliability,  for  parts  already  on 
test,  steadily  improves  with  the  passage  of  time, 
unless  and  until 

o  A  single  failure  occurs,  which  instantaneously 
and  dramatically  increases  the  proven  failure 
rate,  certainly  with  no  accompanying  change  in 
product  cost. 

It  is  nevertheless  reasonable  to  assume  a  correlation 
among  reliability,  quality,  and  cost.  Furthermore,  attempts 
by  manufacturers  to  qualify  parts  to  various  reliability 
levels  must  involve  two  major  aspects  of  quality  improve¬ 
ment: 

o  Use  of  basic  research,  along  with  destructive  test 
and  field  failure  data,  to  modify  and  improve 
materials  and  processes; 

o  Maintenance  of  quality  of  materials  and  uniformity 
of  processes. 

The  cost/reliability/complexity  graphs  shown  in  Figs. 
II-2  and  II- 3  (which  will  be  described  at  the  end  of  Sec. 
11-6)  were  prepared  for  three  approaches  to  system  parts 
procurement: 

o  Buy  good  commercial- industrial  (computer- grade) 
parts  and  use  as-received; 

o  Buy  as  above  but  perform  limited  in-house  screen¬ 
ing  and  burn- in; t 


tSuch  in-house  tests  include:  Transistor-measure  and 
record  hpE  and  ICBu>  then  bake,  temperature  cycle  and  centri 


MTBF  ( hr  ) 


MT8F  x  number  of  unit  complements  (thousands  of  hr) 


Fig. IT-3  —  Computer  cost  and  reliability 
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o  Buy  high-reliability  parts  and  use  as-received. 
"High-reliability"  here  assumes  some  procedure 
such  as  two  stages  of  100  per  cent  screen  and 
burn-in,  sampling  tests  for  mechanical  and  en¬ 
vironmental  stresses,  long-term  life  tests,  and 
a  quality  assurance  program. 

Individual  part  costs,  in  reasonably  large  quantities, 
are  estimated  below  in  Table  II-5,  with  estimated  failure 
rates  in  c/o/ 1000  hr  in  parentheses. 

Relations  between  the  failure  rates  permit  expressing 
the  system  complexity  in  "equivalent  transistors"  and  use 
of  only  the  transistor  failure  rate  in  reliability  calcu¬ 
lations.  The  complexity  is  defined  as 


Complexity  =  T  +  D/2  +  (R+C)/30  , 

where  T  =  number  of  transistors,  D  =  number  of  diodes, 

R  =  number  of  resistors,  C  =  number  of  capacitors. 

Approximate  complexities  of  some  existing  systems 
are  shown  in  Table  II- 6  for  orientation. 

The  percentage  of  each  type  of  part  in  an  average 
computer  is  estimated  in  Table  II- 7. 

If  a  unit  parts  complement  is  taken  as  10,000  tran¬ 
sistors,  15,000  diodes,  15,000  composition  resistors,  5000 
film  resistors,  and  5000  mica  capacitors,  the  complexity 


*F£  and  ICB0  again  and  reject 


fuge,  measure  and  record  hj 

any  failed  or  deviant  units;  Diode--same  as  above,  measur¬ 
ing  and  lrevi  Integrated  circuit--as  above,  measuring 

selected  transfer  relations;  Resistor--measure  R,  tempera¬ 
ture  cycle  and  measure  R;  Capacitor--measure  C  and  I-j.eak> 

and  apply  simultaneous  voltage  and  temperature  stress, 

measure  C  and  I,  .  . 

leak 
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Table  II- 5 

COSTS  FOR  THREE  GRADES  OF  PARTS 


Item 

Computer  Grade 
($) 

Computer 
Grade  with 
User  Screen 
and  Burn-in 
($)  _ 

High- 

Reliability 

($) 

Transistor 

3.00  (.01) 

3.35  (.005) 

4.50  (.003) 

Diode 

0.50  (.005) 

0.65  (.0025) 

1.00  (.0015) 

Comp.  Resistor 

0.04  (.0003) 

0.08  (.0002) 

- 

Film  Resistor 

0.10  (.0003) 

0.15  (.0002) 

.5C  (.0001) 

Mica  Capacitor 

0.04  (.0003) 

0.08  (.0002) 

.35  (.0001) 

Int  Ckt,a 

b 

10  parts 

10.00  (.02) 

12.00  (.009) 

15.00  (.005) 

30  parts 

20.00  (.04) 

24.00  (.02) 

30.00  (.009) 

100  parts 

40.00  (.07) 

48.00  (.03) 

_ 

60.00  (.015) 

3 

Integrated  circuit-- price  estimated  for  mid- 1965;  avail¬ 
ability  and  price  of  more  complex  circuits  somewhat  uncertain. 

^The  parts  are  both  active  and  passive. 
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Table  II-6 

COMPLEXITY  OF  EXISTING  SYSTEMS 


Sys  tem 

Complexity  in 
Equivalent  Transistors 

FSQ-32 

383  x  103 

FSQ-31V 

274  x  103 

CDC-3600 

97  x  103 

C DC- 1604 A 

82  x  103 

Univac  1107 

CO 

O 

t- i 

X 

CO 

Burroughs  B-5000 

67  x  103 

Honeywell  H-1800 

49  x  103 

Honeywell  D-825 

41  x  103 

SDS  9300 

35  x  103 

USQ-20 

30  x  103 

IBM  7090/44 

26  x  103 

GE  215/225/235 

22  x  103 
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Table  11-7 

PART  PERCENTAGES  OF  AN  AVERAGE  COMPUTER 


Part 

Percentage 

Transistor 

20 

Diode 

30 

Composition 

resistor 

30 

Film  resistor 

10 

Mica  capacitor 

10 

and  cost  for  each  of  the  three  procurement  policies 
listed  on  pp.  38-43  are  given  in  Table  II-8. 

The  failure  rate,  appearing  in  Table  II-9,  is  the 
product  of  the  number  of  equivalent  transistors  by  the 
transistor  failure  rate  in  each  procurement  category. 

Evaluation  of  configurations  using  integrated  cir¬ 
cuits  of  various  complexities  requires  some  cautious 
interpretation.  The  results  below  are  based  on  the 
following  assumptions. 

o  One  can  integrate  80  per  cent  of  the  system.  The 
remaining  20  per  cent,  such  as  line  and  memory 
driver  circuits,  remains  discrete. 

o  The  integrated  portion  is  2/3  logic- type  circuits 
and  1/3  flip-flop  or  register- type  circuits. 

o  Speed-efficiency  tradeoffs  are  such  that  an  inte¬ 
grated  logic  circuit  requires  20  per  cent  more 
parts  than  its  discrete  equivalent,  and  an  inte- 
grated  register  circuit  requires  twice  as  many 
parts  as  its  discrete  counterpart. 
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Table  II-8 

COST  VS.  PROCUREMENT  POLICY  FOR  A  UNIT  PARTS  COMPLEMENT 


Part 

Quantity 

Transistor 

Equivalent 

Cost  Extensions 

As  Is 

Burn- in 

Hi-Rel 

Transistor 

10,000 

10,000 

$30,000 

$33,500 

$45,000 

Diode 

15,000 

7,500 

7,500 

9,750 

15,000 

Comp . 

15,000 

500 

600 

1,200 

7,500a 

res istor 

Film 

5,000 

167 

500 

750 

2,500 

res istor 

Mica 

5,000 

167 

200 

400 

1,050 

capacitor 

Total  equivalent 

18,334 

transistors 

Total  parts  cost 

$38,800 

$45,600 

$71,050 

c Assumes  all  film  resistors  used. 


Table  II-9 

FAILURE  RATE  VS.  PROCUREMENT  POLICY  FOR  A 
UNIT  PARTS  COMPLEMENT  USING  DISCRETE  PARTS 


Procurement 

Policy 

Failure  Rate 
70/ 1000  hr 

Mean  Time 

Between  Failures 
(MTBF) 

Computer  grade 

183 

547 

Computer  grade 

92 

1090 

with  burn-in 

High  reliability 


55 


1820 
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The  breakdown  of  a  unit  system  using  integrated 
circuits  is  given  in  Table  11-10. 


Table  II- 10 

DISTRIBUTION  OF  PARTS  IN  A  UNIT  SYSTEM 


Part 

Discrete 

Transistor 

Equivalent 

Uncorrected 

Integrated 

Transistor 

2,000 

2,000 

8,000 

Diode 

3,000 

1,500 

12,000 

Comp,  resistor 

3,000 

100 

12,000 

Film  resistor 

1,000 

33 

4,000 

Mica  capacitor 

1,000 

33 

4,000 

Total  discrete 
transistor- 
equivalents 

3,666 

Total  uncorrected 
integrated  equiv¬ 
alent  parts 

40,000 

Using  the  assumptions  given  on  p,  43,  the  number  of 
integrated  equivalent  parts,  corrected  for  speed  and 
efficiency,  is 


2/3  x  1.2  x  40,000  +  1/3  x  2  x  40,000  -  58,666  . 


Thus  the  required  number  of  integrated  circuits  of  com¬ 
plexity  10,  30,  and  100  is 
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N 


10 


=  5867 


N 


100 


1955 

587 


and  the  failure  rates  for  a  unit  parts  complement  computer, 
under  the  three  procurement  policies  and  using  integrated 
circuits,  are  given  in  Table  II- 11. 

Table  11-12  summarizes  the  overall  results.  Before 
any  conclusions  are  drawn,  additional  interpretation  is 
required.  Although  the  high-complexity,  premium- priced 
(high-reliability),  integrated-circuit  approach  seems  to 
give  the  most  system  reliability  per  parts  dollar,  the 
quantities  required  for  any  really  large  program  (e.g. , 
data  processing  for  a  ballistic  missile  defense  system) 
could  easily  challenge  the  capability  of  the  industry  to 
meet  the  need.  For  instance,  if  100  systems,  each  of 
ten-unit-complement  size,  were  built,  587,000  high-com¬ 
plexity  integrated  circuits  would  be  required.  It  is 
doubtful  that  the  stated  cost-reliability  relation  could 
be  maintained  (pre-1968)  with  production  requirements  of 
this  size. 

Moreover,  the  above  figures  are  sensitive  to  estimates 
of  the  efficiency-ratios  of  integrated  vs.  discrete  circuits. 
If  1.5  and  3.0  are  used  instead  of  1.2  and  2.0,  the  inte¬ 
grated  approach  appears  less  advantageous. 

Also,  consideration  of  the  maintenance  process  and 
maintenance  module  size  (Chap.  VI)  may  show  the  integrated 
approach  to  be  inefficient  in  maintained  ground  systems. 
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Fur  ther  ,  entries  in  Table  11-12  are  parts  costs  only. 
The  various  costs  and  burdens  imposed  in  the  overall  pro¬ 
cess  of  providing  completed,  installed,  maintained  systems 
may  greatly  decrease  the  significance  of  parts  costs.  The 
pro-rate  cost  of  programming  alone  may  approach  the  parts 
costs.  Furthermore,  the  pro-rate  costs  of  design  for 
efficient  utilization  of  high-complexity  integrated  circuits 
may  be  significant. 

Finally,  it  may  later  be  demonstrated  that  even  the 
best  MTBF  given  in  Table  11-12  is  inadequate  with  realistic 
maintenance  times,  whereas  the  necessary  step  to  some  form 
of  redundancy  may  make  the  simplest  approach  more  than 
adequate. 

The  graphs  in  Figs.  II-2  and  II-3  show  relationships 
among  parts  costs,  system  size,  MTBF,  and  mechanization  for 
non-redundant  systems,  with  attempted  indication  of  effects 
of  some  of  the  above  factors. 

7.  PRESENT  AND  PREDICTED  RELIABILITY  LEVELS 

Based  on  the  available  information,  and  assuming 
exponential  failure  distribution,  Table  11-13  gives  esti¬ 
mates  of  present  and  predicted  reliability  levels  for  the 
part- types  which  have  been  evaluated.  In  spite  of  the  fact 
that  each  manufacturer's  sample  that  went  into  the  compila¬ 
tion  of  Table  11-13  claimed  a  confidence  level  of  90  per 
cent,1^  the  spread  of  samples  was  so  large  that  the  accuracy 
of  the  entries  is  probably  +100  per  cent,  -50  per  cent. 

Values  in  the  table  have  90-per-cent  confidence  levels, 
for  55°C  maximum  lead  or  case  temperature,  70°C  maximum 

tThe  90-per-cent  confidence  level  samples  were  for  55°C 
maximum  lead  or  case  temperature,  70°C  maximum  hot-spot 
temperature,  and  50-per-cent  relative  humidity. 
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Table  11-13 

PREDICTED  PART  FAILURE  RATES 
(7=,/ 1000  hr,  10-year  average) 


Part 

Good  1965 

Best  1965 

Best  1968 

Resistor 

(composition,  metal 
film,  tin  oxide) 

.0003 

.0001 

.00003 

Capacitor  (glass, 
mica) 

.0003 

.0001 

.00003 

Diode  (silicon 
planar) 

.005 

.015 

.0005 

Transistor  (silicon 
planar) 

.01 

.003 

.001 

Integrated  circuit 
(silicon  planar) 

10  equivalent  parts 

.02 

.005 

.0015 

30  equivalent  parts 

.04 

.009 

.0025 

100  equivalent  parts 

.07 

.015 

.0040 

hot-spot  temperature,  and  50-per-cent  maximum  relative 
humidity. 

Some  discussion  is  in  order  on  the  integrated  circuit 
entries,  for  two  reasons.  The  first  is  the  widely-publicized 
industry  attitude  that  an  integrated  circuit  can  be  just  as 
reliable  as  a  single  transistor,  because  the  manufacturing 
processes  are  identical.  This  just  is  not  true;  composite 
information  from  Borofsky  [21]  and  failure  mode/mechanism 
data  (Table  B-2)  gives  the  approximate  percentage  contri¬ 
butions  of  planar  process  defects  to  part  failure  shown 
in  Table  11-14. 
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Table  11-14 

PER  CENT  CONTRIBUTION  OF  PLANAR  PROCESS  DEFECTS 

TO  PART  FAILURE 


Item 

Defect 

Percentage 

1. 

Package 

17.0 

2. 

Gross  (scratch,  crack,  foreign 
material,  corrosive  residue,  etc.) 

12.2 

3. 

Die  bond  to  package 

6.1 

4. 

Surface,  contamination,  passivation, 
diffus ion 

27.6 

5. 

Bondis  to  die 

22.  i 

6. 

Leads  to  terminals 

7.2 

-7 

/  . 

Deposited  aluminum  interconnections, 
windows,  registration 

6.4 

8. 

Silicon  material 

_ 

1.4 

The  first  three  items  admittedly  affect  transistors 
and  integrated  circuitry  equally.  Items  5  and  6  are 
directly  related  to  the  number  of  external  connections, 
whit,  might  be  expected  to  increase  somewhat  with  circuit 
complexity.  A  transistor  has  two  internal  leads  (col¬ 
lector  connected  to  case),  a  10- equivalent-part  circuit 
(e.g.,  3  input  NOR)  has  6  leads,  and  30-part  and  100-part 
circuits  are  estimated  to  require  10  and  14  leads, 
respectively. 

Items  4  and  8  are  proportional  to  surface  area.  Al¬ 
though  a  10- part  circuit  can  be  put  on  a  chip  not  much 
larger  than  that  required  for  a  transistor  (due  to  handling 
limitations),  a  30-part  '*  :uit  would  require  somewhat  more 
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area,  and  a  100- part  circuit  considerably  more  at  the 
present  state  of  the  mask  fabrication  and  registration 
ai  t . 

Even  with  maximum  topological  cleverness,  Item  7  is 
quite  significant.  It  is  estimated  that  the  ratio  of 
windows  ana  interconnections  to  parts  is  40  per  cent  for 
the  10-part  circuit,  30  per  cent  for  the  30-part  circuit 
and  20  per  cent  for  the  circuit  of  100  equivalent  parts. 

With  the  above  estimates,  it  is  possible  to  prepare 
a  weighted  relative  failure  rate  figure  for  each  complexity 
level.  This  is  shown  in  Table  11-15. 

From  the  above  very  approximate  analysis,  it  would 
appear  that  the  failure  rates  for  integrated  circuits  of 
10,  30,  and  100  equivalent  parts  might  respectively  be  2, 
3.5,  and  6  times  that  of  a  single  transistor.  Figures  in 
the  failure  rate  chart  assume  some  improvement  in  these 
ratios  in  the  future,  and  are  scaled  to  the  .005  entry 
for  "best  1965,  10  equivalent  parts,"  which  essentially 
represents  the  most  significant  actual  failure  rate  datum 
obtained  [91. 

With  respect  to  external  circuit  failures  induced  by 
degradation  of  internal  elements,  integrated  circuits 
appear  to  have  a  considerable  statistical  advantage,  over 
discrete  circuits.  This  advantage  might  be  partially  or 
completely  lost,  due  to  the  compromises  required  in  mono¬ 
lithic  device  design. 

The  second  item  For  discussion  related  to  integrated 
circuits  is  the  rather  poor  overall  showing  of  integrated 
circuits  in  the  Compendium  of  Failure  Statistics  (Appendix 
E) .  At  first,  the  figures  seem  to  introduce  serious  doubts 
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aoout  recommending  their  use  in  large-scale,  ground-based 
computers  where  size  and  weight  are  not  at  a  premium.  As 
it  is  apparently  possible  to  achieve  . 00002%/ 1000  hr 
failure  rates  with  well-made  solder  joints  or  welds  [9  , 
it  is  not  reasonable  to  argue  that  interconnections  will 
cause  discrete  circuits  to  be  unreliable. 

Moreover,  it  appears  that  the  short  test  history  of 
integrated  circuits  relative  to  discrete  ;  rts  has  not 
made  it  possible  to  prove  their  inherent  reliability.  The 
scatter  chart  of  Fig.  II-4  shows  the  relationship  between 
the  amount  of  testing  and  computed  reliability.  Results 
of  19  brute- force  life  tests  were  plotted,  at  the  90- per¬ 
cent  confidence  level,  versus  total  unit  hours  of  each 
test.  The  apparent  correlation  indicates  the  dependence 
of  demonstrated  reliability  on  test  time.  Indeed,  the 
integrated  circuits  which  established  the  point  at  5  x  10^ 
hr,  .0057,/ 1000  hr,  may  well  possess  the  .0015  inherent 
reliability  predicted  for  1968,  but  some  90  million  addi¬ 
tional  unit-hours  would  have  to  be  accumulated  for  brute- 
force  proof. 

Reliability  comparisons  are  further  affected  by  the 
fact  that  an  integrated  circuit  of  ten  equivalent  parts 
will  not  directly  replace  ten  discrete  parts.  At  the 
present  time,  the  discrete  circuit  will  be  somewhat  more 
efficient  in  gain,  bandwidth,  noise  rejection  and  time-race 
immunity  than  its  integrated  counterpart.  Certain  modified 
hybrid  techniques,  using  miniature  discrete  transistors, 
overcome  these  obstacles  at  the  expense  of  the  as-yet-unknown 
reliability  differential  introduced  by  the  assembly  process. 

^Also  see  Ref.  22,  p.  53;  Ref.  23,  p.  159;  Ref.  24, 
p.  211;  and  Ref.  25,  p.  227. 
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8.  PREDICTED  PART  COSTS 

The  price  of  transistors,  diodes,  resistors,  and 
capacitors  is  presently  leveling  off  and,  in  some  cases, 
showing  a  slight  uptrend  due  in  part  to  decreased  demand. 
Improving  processes  and  yield  may  decrease  silicon  planar 
diode  and  transistor  prices. 

Dramatic  reductions  are  forecast  in  the  integrated 
circuit  area.  Some  reasonable  assumptions  must  be  made 
about  the  minimum  cost  of  the  package,  packaging  labor, 
and  test  labor  or  test  equipment  amortization.  Table  11-16 
shows  the  predicted  price  chart  reflecting  a  composite  of 
industry  predictions  and  minimum  production  cost  estimates, 
even  at  high  yield  from  the  diffusion  processes. 

Table  11-16 

PREDICTED  PRICE  OF  INTEGRATED  CIRCUITS3 


Equivalent  No. 
of  Parts 

Price, 

$,  10,000  Quantities 

1965 

1968 

1970 

10  computer  grade 

10 

5 

2 

10  high  reliability 

15 

9 

5 

30  computer  grade 

20 

10 

4 

30  high  reliability 

30 

17 

9 

100  computer  grade 

40 

20 

8 

100  high  reliability 

60 

33 

16 

aNote  these  are  MIL  computer  grade  components,  not  the 
so-called  indus trial- commercial  grade  units  currently 
offered  at  a  few  dollars  each. 
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Improvements  in  Materials,  Processes,  and  Quality  Control 

The  combined  results  of  life-test  and  physics-of- 
failure  studies  are  continuously  being  fed  back  into  the 
manufacturing  processes  of  competent  suppliers.  Needless 
to  say,  this  is  a  diminishing-returns  operation,  at  least 
where  price  is  some  sort  of  consideration. 

As  it  appears  that  more  parts- per-chip  is  a  legitimate 
reliability  objective  for  integrated  circuits,  improvements 
are  required  in  mask-making  and  registration  processes,  and 
several  seem  to  be  on  the  way,  as  by-products  of  automatic 
machine-tool  control,  precision  photolithography,  and 
similar  efforts. 

Where  the  manufacturer  does  not  have  the  impetus  of  a 
generously-funded,  high-reliability  program  (Apollo,  Polaris, 
Minuteman) ,  his  ambitions  may  be  split  between  capturing 
some  share  of  the  coming  industrial- commercial  entertain¬ 
ment  market  in  silicon  devices  and  becoming  one  of  a  select 
group  of  qualified  suppliers  under  high-reliability  military 
specifications.  Usually,  qualification  is  at  the  expense 
of  the  buyer,  and  in  some  cases  is  quite  expensive.  In¬ 
cluded  in  the  brute- force  life  test  requirements  of  MIL- 
R-38100A,  for  instance,  is  the  following  case: 

Class  Z,  ,0001%/1000  hr,  90  per  cent  confidence, 

requires  testing  of  23,026,000  parts  for  1000  hr 

with  no  failures  for  qualification. 

A  user  requiring  a  million  half-watt  composition 
resistors  qualified  as  above,  would  have  to  pick  up  the 
"reliability  overhead"  of  23  million  additional  resistors, 
a  test  rack  consuming  11.5  megawatts  of  power,  and  some 
six  man-years  of  before-and-af ter  measurement  (at  one 
second  per  resistor). 


-58- 


Similar  fascinating  requirements  are  found  scattered 
profusely  throughout  the  many  high-reliability  military 
specifications.  It  i.s  exactly  this  situation  which  lends 
support  to  the  physics-of- failure,  quality-control  oriented 
approach  to  lower-cost,  higher- inherent-reliability  parts 
production. 

Possible  Breakthroughs 

Reduction  to  practice  of  thin- film  active  element 
production,  or  demonstration  of  high-reliability  in  hybrid 
methods  using  discrete  transistors  could  introduce  an  im¬ 
portant  alternative  approach  to  high- availability  ground- 
based  computer  production.  The  significant  potential  of 
the  hybrid  is  in  the  increased  circuit  efficiency  due  to 
more  stable  resistor  and  capacitor  values,  freedom  from 
internal  compromise,  and  higher  net  yield,  as  passive  and 
active  components  may  be  tested  (and  burned- in)  separately 
before  assembly.  The  IBM  System/360  is  committed  to  a 
hybrid  (flip-chip)  approach,  and  large  quantity  quotations 
of  $1.00-2.50  per  flip-flop  have  been  obtained  from  hybrid 
suppliers  for  other  systems. 

Other  possible  approaches  under  development  are 
metal-oxide-silicon  integrated  circuits  and  field-effect 
transistor  circuits.  It  is  doubtful  that  high- reliability 
procurements  initiated  in  1965  should  depend  on  the  to-be- 
demonstrated  reliability  of  these  device* 
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9.  CONCLUSIONS  AND  RECOMMENDATIONS  ON  PARTS 


Semiconductor  Devices 


Transistors  and  diodes  have  three  conveniently  related 
characteristics : 

o  They  account  for  most  of  the  parts  cost  (84-97 
per  cent  in  the  examples  above); 

o  They  are  responsible  for  most  of  the  failures 
(viz.,  95  per  cent); 

o  Reliability  improvement  by  conventional  means  costs 
the  least  percentage-wise  (50-100  per  cent  for 
transistors  and  diodes;  400-900  per  cent  for  re¬ 
sistors  and  capacitors). 

Furthermore,  the  cost  of  parts  is  only  a  moderate 
fraction  of  the  cost  of  checked  out,  delivered,  installed 
computer  systems  complete  with  working  programs.  There¬ 
fore,  the  following  general  requirements  should  be  imposed 
on  semiconductor  procurement: 

o  Select  potential  suppliers  on  the  basis  of  proved 
reputation  in,  and  apparent  large-scale  commitment 
to,  the  high-reliability  market; 

o  Require  a  two-stage  100  per  cent  screening  process 
on  parts,  plus  mechanical  and  environmental  testing 
on  a  sampling  basis,  such  as  specified  in  MIL-S- 
19500D.  The  overall  process  should  be  as  shown  in 
the  flowchart  given  in  Fig.  IT-5. 

The  second- stage  burn-in  and  test  may  be  performed 
by  either  the  supplier  or  the  user.  The  decision  may  well 
be  an  economic  one,  although  availability  of  results  of 
second- stage  burn-in  is  an  invaluable  aid  in  selecting  a 
vendor,.  Obviously,  any  failures  should  be  treated  as  rare 
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Fig.  H-5 — Test  process  for  semiconductor  device  pro-urement 


and  valuable  items  and  subjected  to  a  carefully-designed 
post-mortem  examination,  usually  by  the  device  manufacturer. 

Require  or  perform  some  form  of  long-term  life  testing 
on  samples  of  devices  taken  after  second-stage  screening,  to 

o  Detect  any  changes  in  quality  not  caught  by  screen¬ 
ing  and  sampling  tests; 

o  Uncover  any  new  long-term  inherent  failure  mechanisms; 

o  Be  sure  that  burn-in  is  not  accelerating  any  long¬ 
term  failure  mechanisms. 
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All  of  these  recommendations  apply  equally  to  inte¬ 
grated  circuit  procurement.  Although,  on  a  "parts-only" 
basis,  it  appears  logical  to  insist  on  high-complexity 
integrated  circuit  implementation,  such  a  decision  should 
be  subject  to  evaluation  of  the  following: 

o  Effect  on  maintenance  module  size  and  cost,  the 
number  and  type  of  spares  to  be  stocked,  and  the 
equipment  and  training  required  for  depot  main¬ 
tenance; 

o  Effect  on  design  cost  and  complexity  and  inter¬ 
connection  reliability  due  to  topological  restric¬ 
tions  and  high  interconnection  density  imposed  by 
small  size; 

o  Actual  price,  availability  and  net  (post-screening) 
yield  of  high-comp] exity  circuits  at  time  of  system 
procurement . 

For  procurements  initiated  in  calendar  1965,  it  appears 
that  reputable  digital-systems  manufacturers  could  present 
valid  cases  for  either  discrete  or  monolithic  silicon  inte¬ 
grated-circuit  implementation.  Also,  it  is  possible  that 
manufacturers  committed  to,  and  skilled  in,  the  modified 
hybrid  ("flip-chip,"  etc.)  approach  could  compete  if  some 
objective  demonstration  of  the  reliability  of  the  configura¬ 
tion  can  be  presented. 

Some  recently  published  information  (26,27]  indicates 
that  the  "best  1968"  reliability  levels  for  transistors 
and  diodes  are  achievable  today,  in  organizations  having 
complete  control,  from  the  raw  material  procurement  to 
equipment  installation  and  maintenance.  It  is  interest¬ 
ing  to  note  that  the  system,  Bell  Electronic  Switching 
System  No.  1  [8],  is  dupltxed  for  reliability  at  the  systems 
level,  has  a  target  of  40  years  mean  time  to  dual  failure 
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with  one-  to  three-hour  service  time,  and  is  constructed 
with  discrete  parts  on  etched  cards  and  wire-wrapped  back- 
boards  . 

Dickenson  [28]  describes  a  hybrid  thin- film  approach 
for  a  high-reliability  space  mission  using  triple  voting 
redundancy.  Logic  modules  of  complexity  up  to  14  parts  are 
used  with  an  estimated  individual  failure  rate  of  .03  to 
.04%/ 1000  hr. 

Fagg,  et  al. ,  [29]  and  Davis,  et  al . ,  [30]  describe  a 
series  of  commercial  computer  implementations  using  es¬ 
sentially  the  same  hybrid  approach. 


Resistors 

As  costs  and  reliability  effects  are  negligible,  metal- 
film  or  tin  oxide  resistors  should  be  used  wherever  the 
narrowed  degradation  tolerances,  and  availability  of  a 
larger  selection  of  nominal  values,  permit  higher  circuit 
efficiency.  A  decision  to  use  composition  resistors  else¬ 
where  appears  optimal,  unless  there  is  some  effect  on  spare 
parts  costs.  For  all  resistors,  temperature-cycling  fol¬ 
lowed  by  100  per  cent  inspection  on  a  limit  bridge  is 
recommended  to  catch  gross  genetic  defectives. 

Capacitors 

Conventional  dipped-mica'  capacitors  may  be  used,  with 
100  per  cent  voltage- temperature  burn-in  and  limit-bridg. ng. 
High-reliability  units  might  alternatively  be  selected, 
but  there  seems  to  be  more  assurance  (if  only  emotional)  in 
100  per  cent  inspection  at  the  user’s  site,  and  the  latter 
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may  cost  considerably  less.  Mylar  capacitors  should  be 
handled  as  above. 


In-Plant  Parts  Handling 

Parts  of  initially  excellent  quality  are  exposed  to 
a  high  risk  of  traumatic  experience  along  the  path  from 
receiving  inspection  to  equipment  installation.  Innumer¬ 
able  examples  of  mistreatment,  from  a  single  resistor  to 
entire  computer- logic  sections,  have  been  discovered.  A 
typical  quote  is  "Oh,  yeah,  old  serial  number  3  is  always 
a  pain  to  keep  on  the  air.  That's  because,  when  we  were 
checking  it  out,  a  regulator  went  out  in  the  lab  power 
supply  and  the  voltage  went  from  12  up  to  50.  There's 
still  a  spot  on  the  ceiling  where  one  of  the  electrolytics 
exploded."  But  old  number  3  was  nevertheless  shipped. 

The  following  recommendations  should  be  imposed  on 
all  system  suppliers--and ,  in  fact,  any  not  already  in¬ 
corporating  most  of  them  (particularly  in  integrated  cir¬ 
cuit  work)  should  be  viewed  with  suspicion. 

o  After  receiving  inspection,  store  parts  method¬ 
ically  in  a  kncwn  environment  and  in  a  manner 
which  prevents  physical  diimage.  Preferably,  parts 
are  stored  in  the  carriers  in  which  they  will 
later  be  delivered  to  the  assemblers. 

o  Assemble  under  scrupulously  controlled  conditions 
of  environment,  cleanliness,  and  operator  aptitudes, 
training,  and  integrity. 

o  Devise  test  and  checkout  equipment  and  its  inter¬ 
connections  such  that  there  is  a  vanishing  proba¬ 
bility  of  part  overstress  through  misuse  or  mal¬ 
function.  Conduct  suitable  selection  and  indoctri¬ 
nation  of  test  and  checkout  personnel  to  minimize 
overstress  but  to  insure  it  is  reported  if  it  occurs. 
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Circuit  and  system  design  will,  of  course,  incorporate 
adequate  part  derating  factors  in  normal  operation.  Design 
must  also  minimize  probability  of  overstress  in  all  failure 
modes.  Power  supply  regulator  shorts  are  probably  the  out¬ 
standing  offenders,  but  every  system  contains  numerous 
disastrous  possibilities  which  should  be  evaluated.  Pro¬ 
tection  against  transients  from  power  lines,  during  normal 
and  abnormal  power  turn-on  and  shut-down,  must  of  course 
be  provided. 

If  high  pointwise  availability  is  the  goal,  then, 
despite  the  possible  two-to-one  reliability  differential, 
it  is  better  to  have  the  equipment  on  continuously  than 
to  turn  it  off  and  on  as  few  as  two  times  daily,  assuming 
there  is  no  possibility  of  power  transients.  This  is 
principally  true  because  of  the  ever-present  chance  of 
over-voltage  surges  during  the  transient  period  and  the 
impossibility  of  performing  error  checks  when  the  machine 
is  inoperative.  If,  on  the  other  hand,  the  goal  is  to 
maximize  the  "interval  availability,"  namely,  the  fraction 
of  a  specified  interval  of  time  that  the  machine  is  opera¬ 
ting,  then  there  might  be  a  stronger  case  for  turning  the 
machine  off  whan  it  is  not  needed. 

Reliability  of  the  operating  environment  is  essential. 
Humidity  and  temperature  control  should  have  reliability 
comparable  to  that  of  the  computer,  and  of  course  must  be 
interlocked. 

With  all  the  above  precautions,  the  largest  single 
hazard  to  parts  has  yet  to  be  mentioned:  the  maintenance 
man.  Nearly  every  authority,  at  some  stage  of  the  dis¬ 
cussion  introduces  a  statement  such  as  "and  once  you  get 
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it  going,  leave  it  alone.  Don't  let  anybody  into  it  with 
clip  leads  or  probes  (etc.)*"  Assuming  good  design  for 
maintenance  and  good  diagnostic  equipment  and  procedures, 
the  best  insurance  for  parts  is  to  make  sure  that  every 
maintenance  technician  or  engineer  avoids  unauthorized 
procedures  and  reports  every  "accident"  faithfully. 

A  final  note  on  parts:  All  comments  on  screening, 
burn-in,  handling,  and  storage  should  apply,  where  feas¬ 
ible,  equally  to  spares. 
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Chapter  III 
THE  RELIABLE  COMPUTER 


1.  INTRODUCTION 

This  chapter  discusses  the  problems  of  designing  and 
building  a  single,  reliable  computer.  The  possibility  of 
achieving  reliability  by  using  more  than  one  computer  is 
deferred  until  Chap.  IV,  but  use  of  internal  redundancy 

t 

at  various  levels  is  discussed  here. 

The  major  influences  on  reliability  are  not  the 
schemes,  such  as  redundancy,  which  are  sometimes  imple¬ 
mented,  but  the  fundamental  methods  by  which  reliable 
parts  are  converted  into  an  operating  computer--for  it 
certainly  isn't  true  that  reliable  parts  insure  a  reliable 
machine.  For  this  reason,  considerable  space  goes  to  a 
philosophy  of  good  circuit  design. 

The  process  of  logical  design  also  requires  attention-- 
errors  in  logical  design  or  construction  can  show  up  well 
after  the  computer  is  in  the  field,  with  serious  and  usually 
irreversible  consequences  for  military  operations. 

Much  has  been  written  in  recent  years  about  error 
detection,  and  the  subject  is  an  important  one.  We  have 
chosen  to  expand  the  topic  under  the  heading  "failure  de¬ 
tection"  wherein  "errors"  are  part  malfunctions  which  do 
not  produce  immediate  signal  errors,  as  well  as  the  usual 


The  formal  analysis  of  the  redundant  computer  is 
performed  in  Appendix  A,  and  only  selected  results  will 
be  given  here. 
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signal  errors  themselves.  Familiar  t;erms  such  as  "error," 
"detection,"  "correction,"  etc.,  are  carefully  defined, 
and  methods  of  coping  with  these  failures  are  discussed. 

This  chapter  concerns  only  those  circuits  which  pro¬ 
tect  against  failures.  Chapter  VI  discusses  fault  diag¬ 
nosis,  isolation,  and  correction  under  program  control. 
Such  programs,  as  will  be  seen,  can  powerfully  affect 
repair  time  and,  hence,  availability. 

2,  CIRCUIT  DESIGN 

As  discussed  in  Chap.  II,  any  part  "failure"  defini¬ 
tion  other  than  passage  to  a  limit  value  is  arbitrary  and 
should  be  related  to  the  intended  application.  Practical 
circuits  must  be  designed  with  some  allowance  for  part 
parameter  variation  with  environment  and  time,  and  some 
initial  parameter  range  due  to  manufacturing  variations. 

Given  a  circuit  configuration  (schematic  diagram)  and 
perfect  knowledge  of  degradation  behavior  of  the  parts,  it 
is  theoretically  possible  to  determine  circuit  reliability 
versus-per formance  relationships  for  some  stated  design 
life.  For  any  but  the  most  elementary  configurations,  the 
mathematical  task  involved  in  determining  such  relations 
becomes  impressively  complex. 

The  minimum  circuit  of  interest  contains  three  or 
four  inputs,  one  or  two  outputs,  two  or  three  supply  volt¬ 
ages,  and  ten  to  twenty  parts.  A  first-order  characterize 
tion  of  a  resistor  or  capacitor  requires  but  a  single 
parameter.  For  a  transistor  or  diode,  even  the  simplest 
mathematical  model  requires  several  parameters  and  some 
mode  of  approximating  nonlinear  behavior  in  various  opera¬ 
ting  regions. 
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The  engineer  with  a  slide  rule  attempting  to  analyze 
a  circuit  must  be  content  with  crude  piecewise-linear 
approximations  and  one-pole  models  of  the  active  elements, 
such  as  those  of  Ebers  and  Moll  [1]  or  Beaufoy  and  Sparkes 
[21.  The  accuracy  of  this  kind  of  work  (relation  to 
observed  circuit  behavior)  is  ±10  per  cent  at  best,  and 
-90,  +500  per  cent  at  worst,  depending  on  circuit  speed 
and  complexity. 

Developments  in  the  last  decade  dramatize  the  need 
for  "better  tools",  and  the  attempts  to  provide  computa¬ 
tional  relief  and  more  accurate  modeling  show  varying 
degrees  of  utility  and  success  [3-8],  With  integrated 
circuits,  sophisticated  models  and  high-powered  computa¬ 
tion  are  needed,  because  of 

o  The  inherently  more  "distributed"  nature  of  inte¬ 
grated  circuits; 

o  The  near-impossibility  of  "breadboarding"  and 
"laboratory  design"  by  successive  modifications. 

It  should  be  emphasized  that  the  above  work  relates 
to  analysis  only.  True  circuit  synthesis  by  automated 
means  is,  at  best,  still  in  the  "laboratory  curiosity" 
stage. 

The  usual  process  of  circuit  synthesis  proceeds  some¬ 
what  as  follows: 

o  End-to-end  system  specifications  are  somehow  sub¬ 
divided  into  circuit  functions  by  cooperative 
action  of  systems,  logic,  and  circuit  designers; 

The  circuit  designer  takes  a  functional  specifica¬ 
tion  for  a  circuit  and--through  some  combination  of 
experience,  intuition,  and  plagiarism  (research)-- 
selects  one  or  more  tentative  configurations  and 
sets  of  active  elements; 


o 
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o  Using  first-order  (or  zero-order,  e.g.,  "educated 
guess")  approximations,  a  preliminary  set  of  part 
values  is  computed; 

o  The  preliminary  configuration  is  analyzed,  using 
better  models  and  approximations,  if  possible; 

o  Shortcomings  are  corrected  by  parameter  change  or 
substitution  of  better  parts,  as  required; 

o  The  iterative  process  of  analyze-modify  is  con¬ 
tinued  until  performance  is  satisfactory  or  hope 
is  abandoned  and  a  new  configuration  is  sought. 

The  necessary  implements  of  the  process  are  models 
and  computation.  Models  may  be  1)  lumped-constant  equiva¬ 
lent  circuits,  valid  over  certain  specific  parts  of  the 
operating  region,  2)  various  degrees  of  nonlinear,  distri¬ 
buted-constant,  or  3)  true  mathematical  analog  representa¬ 
tions.  Computational  aids  fall  into  two  general  categories 
analysis  and  simulation.  Analysis,  of  course,  is  ideal 
up  to  the  point  at  which  complexity  renders  it  impractical. 
Simulation,  both  analog  and  digital,  has  been  utilized, 
with  incremental-digital  approaches  undergoing  considerable 
investigation  at  this  time. 

Bogey  Design 

All  circuit  design  systems  require  some  guiding 
philosophy  which  defines  theoretical  circuit  failure  as 
a  function  of  theoretical  part  behavior.  The  earliest 
and  simplest  philosophy,  regrettably  still  followed  in 
some  organizations,  is  "bogey"  design. 

In  its  fundamental  form,  bogey  design  assumes  that 
all  part  parameters  are  at  their  new-nominal  (as-labeled) 
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values,  and  will  stay  there  forever.  The  only  reasonable 
defense  that  might  be  used  by  a  thinking  bogey-designer 
is  that  positive  and  negative  deviations  are  equally  likely, 
small  deviations  are  more  likely  than  large,  and  the  effect 
of  many  circuit  topologies  then  gives  some  sort  of  statis¬ 
tical  protection.  A  slightly  advanced  form  of  bogey  design 
admits  part  variation  over  the  manufacturing  toleiance  range 
but  no  more.  The  effect  of  part  selection  by  the  supplier 
and  the  resultant  rectangular  distribution  in  as-delivered 
parameters,  is  obvious. 


Worst-Case  Design 

The  wave  of  reaction  to  bogey  design,  started  by  the 
obvious  fall-off  of  reliability  as  systems  become  more 
complex,  led  to  the  formation  of  various  "worst-case"  de¬ 
sign  philosophies.  The  purest  and  simplest  of  these 
(called  "worst-worst-case"  in  some  circles)  operates  as 
follows : 

Absolute  end-of-life  limits  are  provided  for  all 
part  parameters.  The  designer  does  not  consider 
whether  these  are  really  absolute  or  actually 
some  sort  of  statistical  limit,  but  designs  the 
circuit  so  that  it  meets  specifications  with  all 
parameter  values  simultaneously  at  their  mo^t 
unfavorable  limits.  This  usually  requires 
selecting  a  different  set  of  limits  for  analysis 
with  respect  to  each  operating  specification. 

The  beauty  of  this  process  is  its  mathematical 
simplicity.  Sets  of  limits  are  selected,  usually 
by  inspection,  or  at  worst  by  a  few  coarse  trials; 
the  usually  few  part  parameters  contributing  to  the 
function  under  evaluation  are  then  determined  by 
solution  of  sets  of  simultaneous  inequalities.  If 
the  limits  are  properly  selected  and  if  there  are  no 
mathematical  errors,  the  circuits  always  work,  and 
continue  to  work. 
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The  major  disadvantage  here  is  the  inefficiency  of 
over-design.  Ultra-conservative  worst-case  protection 
against  degradation  increases  the  number  of  parts  in  a 
given  system  and  the  vulnerability  to  part  failure  of  the 
"limit"  sort  (open,  short,  anomalous  degradation). 

A  reasonable  compromise,  now  practiced  by  many  re¬ 
sponsible  organizations  delivering  excellent  equipment, 
is  worst-case  design  with  narrowed  tolerances.  They  use 
the  same  mathematical  procedures  but  they  perform  limit 
selection  on  a  statistical  basis  to  yield  narrower  spreads. 

Another  approach  is  the  so-called  "Taylor  worst-case" 
design,  wherein  the  specifications  must  be  met  when  any 
one  part  is  at  its  most  unfavorable  extreme,  with  all  others 
either  at 

o  New-nominal,  or 

o  The  most  unfavorable  extremes  of  manufacturing 
tolerance. 

This  technique  is  not  often  used  because  the  mathe¬ 
matical  labor  is  excessive,  and  the  compromise  varies  with 
circuit  configuration. 

Statistical  Design 

Statistical  design  is  the  theoretically- ideal  design 
philosophy  because  it  considers  the  behavior  of  the  circuit 
throughout  the  range  of  parameter  variation,  instead  of  just 
at  the  limits  of  parameter  values.  The  required  information 
is 

o  Distributions  of  part  parameters  with  stress  and  time 

o  Limits  of  stress  and  the  design  life; 

o  Required  reliability  of  the  individual  circuit,  i.e., 
probability  of  meeting  specifications  in  the  environ¬ 
ment,  over  the  design  life. 
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The  design  process,  then,  somehow  yields  reliable  param¬ 
eter  values  or  indicates  that  the  requirement  cannot  be 
met  with  the  given  parts  and  configuration.  Ideally,  dis¬ 
tributions  of  stresses  would  be  given  rather  than  limits, 
and,  as  ranges  of  parameters  would  result  in  many  cases, 
an  additional  constraint  might  be  introduced,  such  as 
minimum  power  consumption.  Obviously  an  engineer  with  a 
slide  rule  cannot  perform  statistical  design.  Attempts 
at  aids  to  statistical  analysis  have  produced  isolated 
results  in  man-computer  circuit  synthesis  [5,9],  But  at 
present  sheer  complexity  prevents  the  large-scale  use  of 
statistical  design. 

Protection  Against  Part  Failure 

Design  to  protect  against  part  failure,  rather  than 
degradation,  may  exist  at  any  level  from  circuit  through 
module,  or  subassembly  to  the  entire  system.  The  philosophy 
is  generally  referred  to  as  redundancy  at  the  selected 
level.  Two  basic  approaches  to  redundancy  may  be  clas¬ 
sified  as  switching  and  paralleling.  If  the  non-redundant 
system  is  considered  as  a  series  chain  of  elements  (sub¬ 
systems,  circuits,  parts)  which  fails  when  any  element 
fails,  it  may  be  represented  as  shown  in  Fig.  Ill-la. 

Switching  redundancy,  in  its  simplest  form,  implies 
the  existence  of  a  spare  for  each  element  and  provision 
for  switching  it  into  the  system,  as  shown  in  Fig.  Ill-lb. 

There  are  two  additional  requirements: 

o  Means--either  human  or  automatic-- for  detecting 
failure  and  localizing  to  a  particular  element; 


o.  Non-redundant  series  chain  circuit 


b.  Switching  redundancy 


c.  Parallel  redundancy 

Fig  .HE- 1 — Redundant  circuits 


/ 


-74- 


o  Design  the  system  and  the  switch  such  that  time 
required  to  switch  does  not  affect  system 
operation. 

If  several  of  the  elements  are  identical,  only  one  spare 
of  that  type  need  be  available,  provided  the  switch  is 
suitable. 

Figure  III-lc  shows  parallel  redundancy.  Here  the 
circles  represent  interconnections  designed  such  that 
failure  of  an  element  in  any  possible  mode  cannot  prevent 
the  associated  parallel  element  from  performing  its 
function . 

Switching  redundancy  is  sometimes  referred  to  as 
"standby"  redundancy.  The  spares  may  or  may  not  be 
energized  prior  to  use.  For  computing  reliability  of 
elements  subject  to  wearout  failure  under  power  (light 
bulbs,  electro-mechanical  devices)  this  distinction  is 
extremely  significant.  For  systems  which  are  predominantly 
semiconductors,  the  distinction  is  insignificant,  con¬ 
sidering  the  cost  of  the  added  switching  complexity  to 
apply  power  to  a  spare. 

At  the  ^art  level,  switching  redundancy  is  impractical, 
as  the  reliability  of  a  switch  is,  at  best,  similar  to  that 
of  a  part.  Parallel  redundancy  at  the  part  level  is 
achieved  by  some  sort  cf  series-parallel  connection  of 
single  parts  or  "subcircuits"  consisting  of  a  few  parts 
each.  Further  discussion  on  circuit  redundancy  is  deferred 
to  Sec.  III-4. 
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Recommendations  in  the  Procurement  of  Good  Circuit  Design 

Several  requirements  for  responsible,  reliable  circuit 
design  are  listed  below.  These  migc*.  be  considered  as 
checkpoints  on  the  integrity  of  design. 

o  There  must  be  a  definitely  stated,  defensible 
design  philosophy. 

o  Rigorous  adherence  to  the  philosophy  is  essential, 
and  high-level  technical-administrative  approval 
should  be  required  for  allowance  of  any  exceptions 
or  discrepancies. 

Part-types  must  be  rationally  selected. 

o  Consider  present  and  predicted  availability  when 
establishing  design  tolerances. 

o  Require  considerable  investigation  before  selecting, 
and  in  determining  limits  for,  "novel"  parts.  This 

might  include  four-layer  devices,  unijunction 
transistors,  thermistors,  field-effect  transistors, 
tunnel  diodes,  and  multiaperature  ferrite  devices. 

o  Avoid  specification  of  low-yiel  parts  which  may 
become  unavailable  due  to  slight  shifts  in  material 
and  process  parameters  (e.g.,  extremely  high-gain 
transistors) . 

o  Avoid  designs  specifying  parts  requiring  selection 
in  narrow  ranges  or  close  matching  and  tracking, 

o  Avoid  dependence  on  uncontrolled  part  parameters 
for  circuit  function  (e.g.,  reverse  recovery  time 
of  coupling  diodes  for  transistor  turnoff). 

Intensive  effort  must  be  devoted  to  anticipation  and 
evaluation  of  system- imposed  requirements  on  circuit 
specifications.  Some  examples  follow: 
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o  Provision  for  guarding  against  noise; 

o  Determination  of  true  nature  of  interface  signals, 
rather  than  acceptance  of  superficial  specifica¬ 
tions  ; 

o  Regardless  of  the  design  philosophy,  recognizing 
situations  wheie  worst-case  reasoning  must  be 
applied  (as  in  output  signals  and  noise  from  core 
memory  stacks,  v?here  any  core  may  be  interrogated 
with  any  combination  of  overall  information  pattern 
and  previous  history). 

In  human-slide  rule  systems,  prefer  worst-case  design 
with  intelligent  compromises  in  the  selection  of  end-of- 
life  tolerance  limits  and  mathematical  models  of  the 
components . 

In  human-computer  systems,  evaluate  the  extent  and 
nature  of  computer  aids.  Beware  of  systems  under  develop¬ 
ment;  insist  on  acceptable  proof  of  operability  of  models 
and  computing  techniques. 

In  any  case,  look  for  a  stated  design  philosophy, 
organized  review  procedures,  and  well-kept  engineering 
notebooks.  Regardless  of  the  excellence  of  the  above 
procedures,  get  results  of  lab  verification  of  single 
circuits  and  circuits  operated  in  all  reasonably  achiev¬ 
able  combinations  of  loading,  noise,  layout,  intercon¬ 
nection,  and  stress. 

3.  LOGICAL  DESIGN 

The  logical  designer  interacts  with  the  system  de¬ 
signer  and  the  circuit  designer  to  generate  an  intercon¬ 
nection  of  logic  elements  which  will 
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o  Functionally  implement  the  system  specifications, 
and 

o  Operate  electrically  when  interconnected,  without 
violating  established  circuit  limitations. 

The  logic  designer's  responsibility  varies  widely  from 
organization  to  organization.  He  may  work  directly  from 
system  specifications,  thus  annexing  the  system  design 
function.  He  may  generate  actual  diode  network  schematics, 
thus  encroaching  on  circuit  design.  Usually,  though,  the 
logic  designer  writes  a  set  of  logical  equations,  or 
generates  a  logic  schematic  diagram,  subject  to  the  con¬ 
straints  of  system  design  and  circuit  interconnection 
complexity  limits.  In  the  most  sophisticated  systems, 
the  logic  designer  (after  doing  his  own  private  scratch- 
work  in  his  favorite  form)  makes  direct  symbolic  entries, 
representing  his  equations,  on  a  suitably  human- engineered 
keypunch  form.  Some  subset  of  the  following  steps  is  ther 
executed,  depending  on  the  magnitude  of  the  system. 

1)  Keypunch  original  or  modifications  and  verify. 

2)  Run  computer  routine  to  check  for  violation  of 
circuit  limitations,  and  various  consistency 
checks . 

3)  Cycle  1  and  2  until  deck  passes. 

4)  Add  this  segment  to  logic  simulator. 

5)  Run  logic  simulator--results  go  to  logic  and 
system  designer  for  approval. 

6)  Cycle  1-5  until  buildable  portion  is  complete. 

7)  Perform  layout  according  to  minimum-wire- length, 
noise,  and  other  rules.  Check  dynamic  loading, 
and  list  discrepancies  for  logic,  system,  or 
circuit  designer  action. 
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8)  Generate  and  check  control  tape  for  production 
of  backboard  by  automatic  wire  wrap  or  automatic 
multilayer  laminate  method. 

9)  Produce  by-products: 

a)  Bill  of  materials 

b)  Layout  diagrams 

c)  Logical  block  diagrams 

d)  Maintenance  manual. 

An  adequate  present-day  system  would  include  steps  1  and  2, 
with  the  logic  designer  or  a  specialist  producing  the  lay¬ 
out  and  the  computer  producing  wiring  or  interconnection 
instructions.  The  complete  system  described  obviously 
provides  maximum  protection  against  human  transcription 
errors,  wiring  errors,  and  design  errors  with  respect  to 
system  and  circuits. 

The  value  of  added  steps  is  difficult  to  estimate. 

If  designed  specially  for  a  production  run  of  only  a  few 
systems,  the  cost  of  logic  simulation  and  optimum- layout 
programs  would  be  prohibitive.  However,  organizations 
already  possessing  such  programs  may  be  writing  them  off 
across  large-volume  production,  with  obvious  gains  in 
value. 

A  complete  design  automation  procedure  has  peripheral 
value,  in  that  programs  may  be  run  on  a  simulator  (at  a 
price)  in  advance  of  system  construction,  and  maintenance 
and  checkout  procedures  may  likewise  be  established.  An 
estimated  80  per  cent  of  the  design-production  errors  in 
ail-human  systems  are  the  result  of  clerical  and  technical 
mistakes,  rather  than  conceptual  design  flaws.  Design 
and  production  automation  reduces  these. 
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Probabiy  the  first  delivered  unit  of  every  large- 
scale  computer  contains  several  logical,  wiring,  circuit- 
design,  or  operating-program  errors.  The  normal  process 
of  checkout,  delivery,  and  customer  feedback  on  high- 
production  (100  or  more)  computers  results  in  detection 
and  elimination  by  retrofit  and  field  modification  of  all, 
or  very  nearly  all,  such  errors. 

Such  errors  could,  of  course,  be  eliminated  at  the 
source  by  exercising  each  computer  by  another  (presumably 
perfect)  computer  which  simulates  all  possible  modes  of 
operation.  At  any  predictable  computation  rate,  the  time 
required  would,  of  course,  be  astronomical.  However, 
should  a  dual-  or  multi-computer  approach  be  selected,  it 
is  feasible  to  consider  the  "exhaustive  exercise"  as  one 
check  to  be  carried  out,  over  the  years,  at  all  installa¬ 
tions.  Any  tentative  error  which  is  reported  could  be 
verified  at  a  second  location,  before  imolementing  cor¬ 
rective  procedures.  The  cost  involves  writing  and  check¬ 
ing  the  simulator  program,  which  could  nearly  double  the 
total  programming  cost. 

4.  CIRCUIT  REDUNDANCY 

A  circuit  is  said  to  be  n-fold  redundant  if  that 
circuit  is  replicated  n  times  and  the  output  of  the 
aggregate  is  taken  to  be  the  output  which  is  produced  by 
the  majority  of  the  circuits.  This  is  schematically 
illustrated  in  Fig.  III-2. 

To  prevent  a  tie,  n  is  assumed  always  to  be  odd.  A 
"circuit"  may,  in  theory,  consist  of  a  single  part  or  an 
entire  computer. 
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Fig.IU-2  —  Redundant  circuits  with  majority  voter 
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In  an  n-fold  redundant  subsystem,  no  service  is  per¬ 
formed  until  the  entire  redundant  subsystem  fails,  i.e., 
when  more  than  half  of  the  circuits  have  failed.  At  that 
time,  service  is  performed  and  all  n  circuits  are  restored 
to  an  operative  condition. 

Only  results  pertaining  to  the  use  of  redundancy  will 
be  presented  here.  Appendix  A  presents  a  complete  dis¬ 
cussion  and  derivation  of  these  results. 
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Let  P(t)  be  the  probability  that  the  computer  is  on 
at  time  t.  First,  consider  P(t)  as  a  function  of  computer 
size  and  individual  part  failure  rate:  In  Sec.  II-7,  the 
unit  part  complement  was  defined  and  taken  to  be  18,334 
equivalent  transistors.  A  unit  part  complement  of  this 
size  implies  the  failure  rates  (for  present-day  parts) 
shown  in  Table  II-9.  In  addition  to  tnese  failure  rates, 
others  derive  from  either  using  the  future  predictions 
of  Table  11-13,  or  assuming  more  unit  part  complements 
per  computing  system. 

In  general,  when  service  is  possible,  the  transient 
phase  is  not  important  and  the  asymptotic  probability, 

P  ,  is  the  important  measure  of  availability.^ 

Let  p  =  the  service  rate  [1/p  =  the  mean  time  to 

repair  (MTTR) ] ,  and  NX  =  total  number  of  parts  times  the 

part  failure  rate.  N  =  18,334k,  where  k  is  the  number  of 

unit  part  complements  per  system.  Then  Pm  as  a  function 

of  NX  for  the  non-redundant  computer  is  shown  in  Fig.  A- 12 

for  various  service  rates.  With  n-fold  redundancy,  let  M 

be  the  level  of  the  redundancy,  i.e.,  M  gives  the  number  of 

n-fold  redundant  modules  in  the  computer.  Figures  A-16 

to  A- 19  show  Pm  for  a  system  having  three-fold  redundancy 

2  3 

and  M  =  1,  10,  10  ,  10  .  For  the  same  set  of  p's  and  M's, 
the  availability  of  a  five-fold  redundant  computer  is 
shown  in  Figs.  A-20  to  A-23. 

A  few  of  these  results  are  collected  together  and 
shown  in  Figs.  III-3  and  III-4  with  some  pertinent  annota¬ 
tions  on  present  and  predicted  performance. 

+Po  =  lim  P(t).  See  Appendix  A  for  a  discussion  of 
this  point. 


Present  computer  grade  w/'  burn-in  (100  units) 
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The  transient  phase  for  redundant  computers  is  shown 
here  only  for  the  no-service  case;  when  service  is  avail¬ 
able,  there  are  better  methods  than  redundancy,  as  will  be 
shown  in  Chap.  IV. 

All  the  results  presented  in  Appendix  A  will  not  be 
duplicated  here.  The  reader  who  wishes  to  pursue  further 
the  subject  of  availability  of  a  redundant  computer  should 
find  the  values  of  failure  rate  and  number  of  parts  from 
Chap.  II  (or  use  his  own  values),  choose  a  service  rate 
and  a  level  of  redundancy  M,  then  use  the  graphs  in 
Appendix  A  to  find  either  (or  P(t))  for  some  special 
cases . 

5.  FAILURE  DETECTION 

Clarifying  terminology  at  the  outset  requires  the 
following  definitions  and  discussion: 

Defect- -An  inadequacy  in  the  logic,  wiring,  or  pro¬ 
gram  of  the  machine  as  built.  Some  defects  may  in¬ 
deed  be  detectable  by  the  procedures  to  be  discussed. 
Obvious  cases  wiil  be  noted,  but  the  possible 
existence  of  defects  will  in  general  be  ignored. 
(Section  III-3  discusses  prevention  of  logic  and 
wiring  defects,  and  Sec.  V-l  discusses  programming¬ 
coding  defects.)  Installation  of  an  initially-failed 
part  would  also  constitute  a  defect,  but  exhaustive 
module  tests  before  assembly  are  assumed  to  preclude 
this  possibility. 

Maintenance  Modules--The  set  of  smallest  field- 


replaceable  elements.  If  a  connector  or  an  inter¬ 
connection  fails  (hopefully  an  extremely  rare  event) , 
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it  is  probably  repaired  in  the  field,  as  it  seems 
unreasonable  to  define  the  entire  backboard  as  a 
maintenance  module. 

Fault- -Abnormal  electrical  behavior  of  a  maintenance 
module.  A  fault  may  be  the  result  of  part  failure 
or  unpredicted  stress.  A  permanent  fault  results 
from  part  failure  or  (improbably)  from  a  permanent 
occurrence  of  unpredicted  stress.  A  transient  fault 
usually  is  the  result  of  occurrence  of  an  unpredicted 
power,  noise,  or  signal  condition  of  short  duration; 
a  less  likely  cause  is  occurrence  of  a  short-term 
unpredicted  stress  combination  (temperature  surge, 
physical  shock);  a  possible  (though  improbable)  cause 
is  the  excursion  of  a  part  parameter  out  of,  then 
back  within,  degradation  limits. 

Error- -An  instance  of  incorrect  functional  performance 
by  the  system  (e.g. ,  least  significant  bit  of  accum¬ 
ulator  is  always  "one";  wrong  result  of  division  of 
32,169  by  20,447;  launched  39  interceptors  at  a  low- 
flying  ptarmigan).  An  error  is  reproducible  if  it 
always  occurs  when  the  proper  circumstances  are 
established  within  the  machine;  an  error  is  transient 
if  it  occurs  once  and  efforts  to  re-induce  it  fail. 
Note  that  a  transient  fault,  or  a  permanent  fault 
which  is  located  and  corrected,  may  or  may  not  have 
produced  an  error.  Also,  transient  errors  are  as¬ 
sociated  with  transient  faults,  and  reproducible 
errors  with  permanent  faults.  As  subsequently  used, 
"fault"  alone  refers  to  permanent  faults. 
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Detection- -The  process  oi:  determining  thct  a  i.auj It  or 
error  has  occurred.  Also  rolled  recognition. 

Correct ion --Applied  co  an  error,  removal  of  the  error 
before  the  erroneous  information  is  used. 

Location- -The  process  of  determining  which  module  is 
faulty,  or  which  part  has  failed.  Also  called  isolation. 

Repair- -Replacement  of  a  faulty  module  or  a  failed  part. 

Service--The  sequential  execution  of  all  of  the  above 
five  processes. 

There  are  three  levels  at  which  the  above  five  func¬ 
tions  may  be  performed  (distinguishing  fault  detection  from 
error  detection) : 

o  The  human  level,  presumed  self-explanatory  in  all 
five  cases. 

o  The  machine  execution  level— that  is,  in  the  process 
of  executing  a  stored  program.  (Also  called  the 
"program"  or  "software"  level.) 

o  The  machine  implementation  level--that  is,  in  the 

design  and  physical  construction  (also  called  "built- 
in,"  "automatic,"  or  "hardware"  level). 

For  convenience,  the  adjectives  "programmed"  and 
"built-in"  refer  here  to  the  machine  execution  and  machine 
implementation  levels,  respectively.  What  follows  next  is 
a  detailed  examination  of  the  various  possible  models  of 
operating,  excluding  the  human  level. 

With  respect  to  repair,  "programmed  repair"  is,  of 
course,  impossible;  i.e.,  there  is  no  executable  computer 
c'tde  which  will  cause  the  computer  to  repair  itself.  "Built- 
in  repair"  is  essentially  synonymous  with  switching  re¬ 
dundancy  (cf.  p.  72). 
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"Error  detection  and  correction"  permits  a  system  to 
operate  even  after  faults  occur.  Literature  on  the  logic 
and  methods  of  detection  and  correction  is  extensive  [10-14], 
It  is  essential  to  determine  the  value  of  this  form  cf 
protection  relative  to  the  system  application.  The  value, 
in  turn,  is  a  function  of  the  reliability  increase,  the 
effects  on  operating  speed,  and  the  equipment  cost  of 

o  The  amc- '.it  of  detection-correction  provided  and 

o  The  effect  of  partitioning  between  programmed  and 
built-in  methods. 

Errors  are  found  in  information;  faults  are  found  in 
equipment .  Errors  may  be  eroadly  classified  as  follows: 

Transfer  errors-- Information  which  should  be  identical 
at  two  points  separated  in  space-time  is  in  fact 
different.  Memory  transfer  errors  occur  in  reading 
from  storage,  or  less  often  on  writing  into  storage. 
Input-output  transfer  errors  occur  in  many  forms, 
depending  on  the  actual  equipment  and  transmission 
system. 

Operational  errors--The  result  of  some  arithmetic 
or  logical  tperation  is  incorrect. 

Control  errors--That  portion  of  the  machine  «rhich 
identifies  and  sequences  operations  pei formed  im¬ 
properly  (e.g.,  subtraction  is  performed  instead  of 
addition;  a  branch  is  executed  to  the  wrong  instruction). 

The  value  of  detecting  and  correcting  an  error  depends 
on  the  probability  of  its  occurrence,  the  cost  of  protec¬ 
tion,  and  the  consequence  of  letting  the  error  exist.  Some 
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systems  (ballistic  missile  defense)  are  unusual  in  that 
they  are  subjected  to  extremely  short  periods  of  peak 
demand  at  extremely  infrequent  intervals--perhaps  one 
peak  demand,  or  no  peak  demands  at  all,  may  occur  over 
the  entire  design  life.  The  consequences  of  an  error  during 
a  peak  demand  period,  however,  are  extremely  severe. 

A  very  simple  model  may  give  some  insight  into  the 
values  involved.  Assume  peak  demands  of  up  to  100  seconds 
duration,  occurring  once  in  ten  years.  Assume  system 
downtimes  of  1,  10,  or  100  hr  per  year.  (For  ten-hour 
repair  time  this  corresponds  to  MTBF  of  87,600,  8760,  and 
876  hr,  respectively).  Further  assume  transient  error 
rates  of  10  or  100  per  year  (36  days  and  3.6  days  mean 
error-free  time,  respectively).  Figure  III-5  shows  the 
relative  risk  as  a  function  of  demand  duration,  for  various 
combinations  of  downtime  and  error  rate.  Risk  was  computed 
assuming  transient  error  duration  is  negligible  relative 
to  demand  duration,  and  uptime  periods  are  large  relative 
to  demand  duration,  giving 

,  ET  +  3600D 

LSk  "  10x36ox24x3600  ’ 

where  E  =  number  of  errors  per  year,  D  =  number  of  down¬ 
time  hours  per  year,  and  T  =  duration  of  peak  demand  in 
seconds . 

The  curve?  indicate  that,  if  risk  values  of  the  order 

_  3 

of  10  are  adequate,  transient  error  effects  are  negligible. 
But  with  risk  values  approaching  10  ^ ,  transient  errors 
are  relatively  significant.  Transient  error  rates  are 
difficult  to  preui.ee.  It  is  feasible  and  mandatory 


Probability  of  failure  during  peak  demand 
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Peak  demand  duration  (seconds) 


Fig.DI-5  —  Probability  of  failure  during  peak  demand  period 


-90- 


to  provide  all  reasonable  system  and  circuit  protections 
against  transient  errors  resulting  from  power  line  be¬ 
havior,  electromagnetic  interference,  and  the  devices 
involved  in  information  transfer. 

The  remaining  actions  that  may  be  taken  are  error 
detection  and  logging,  and  error  correction.  Detecting 
transfer  errors  is  a  relatively  simple  process,  usually 
accomplished  by  parity  checking.  Control  and  operational 
errors  are  not  so  easily  detected,  because  the  nature  of 
decoding,  arithmetic,  and  logical  operations  is  such  as 
to  destroy  simple  internal  relations,  such  as  parity. 
Fortuitous  exceptions  p^j.  numerous  and  it  is  reasonable 
to  consider  error  detection  in  these  cases,  where  the 
added  cost  is  justified.  Examples  are: 

o  Forbidden  operation  code  detection; 

o  Forbidden  digit  checks; 

o  One-only  checks  on  decoding  matrix  outputs. 

Redundancy  in  arithmetic,  logical,  and  control  equip¬ 
ment,  of  course,  introduces  error  protection,  but  at  a 
parts  cost  at  least  proportional  to  the  order  of  redundancy. 

Built-in  parity  checking  requires  addition  of  one  bit 
to  all  transfer  paths  and  storage  locations.  Also,  the 
added  circuits  to  generate  and  check  the  parity  are  re¬ 
quired.  For  a  reasonably  large  machine  (3000  flip-flops, 
25,000  gates,  36-bit  parallel  operations)  parity  checking 
and  generation  should  not  add  more  than  3  per  cent  to  the 
parts  complement. ^  Programmed  parity  checking  is  possible, 

Methods  of  implementing  parity  checks  are  discussed 
in  Refs.  [12-14^. 
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hat  still  effectively  requires  the  extra  bit  of  storage 
in  that  one  bit  must  be  redundantly  "wasted"  in  each  word 
which  is  checked.  Also,  this  is  a  time-consuming  process 
which  might  introduce  a  hidden  cost  in  the  form  of  in¬ 
creased  overall  speed  requirement. 

In  a  single,  series-chain  (non-redundant)  system,  the 
utility  of  s ingle- transfer  error  detection  differs  markedly 
for  des tractive  and  non- des true t ive  transfer.  In  destruc¬ 
tive  transfer,  information  in  a  register  (usually  a  set  of 
magnetic  cores)  transfers  to  a  second  register,  and  the 
first  register  simultaneously  clears.  If  an  error  occurs 
in  the  transfer,  the  information  is  irrevocably  incorrect, 
even  though  the  error  was  transient.  In  non- destructive 
transfer,  the  content  of  the  first  (sending)  register 
remains,  at  least  until  the  receiving  register  completes 
error  checking.  In  this  case  re- transmission  can  correct 
a  transfer  error. 

In  dual  systems  that  operate  in  synchronism,  detec¬ 
tion  of  an  error  by  one  machine  can  be  used  to  switch  the 
non-erratic  machine  on-line.  Multiple  systems  provide 
many  other  alternatives. 

The  correction  of  single  transfer  errors  by  built-in 
means  requires  the  addition  of  considerable  equipment. 

Six  additional  bits  must  be  generated  and  carried  for 
correction  of  a  single  err^r  in  a  36-bit  data  word.  De¬ 
pending  on  the  number  of  transfer  paths  accommodated,  up 
to  50  per  cent  additional  circuits  might  be  required. 

For  a  single  non-redundant  computer  of  about  100,000 
equivalent- transis tor  complexity,  it.  is  unlikely  that 
MTBF  of  more  than  1000  hr  could  be  achieved.  For  repair 
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times  longer  chan  one  hour,  the  down  time  component  of 

-3 

risk  would  be  greater  than  10  .  This  apparently  makes 

error  detection  and  correction  unnecessary.  If  dual-  or 
multi-computer  redundancy  is  provided,  error  correction 
is  unnecessary,  as  simple  detection  and  switching  will  be 
adequate.  Only  if  some  internally  redundant  approach  is 

selected,  (e.g.,  triple-voting  at  subsystem  or  circuit 

-4 

level:  "quadding")  which  puts  the  risk  below  10  ,  should 

transfer  error  correction  be  considered.  Even  then  it 
appears,  the  addition  of  correction  circuits  to  the 
internal-redundancy  would  result  in  greater  cost  and  an 
aesthetically  less-desirable  system  than  a  dual-  or  multi¬ 
computer  . 

There  is  an  important  further  consideration  relative 
to  error  detection,  even  for  single  computers  in  the  under- 
1000-hr  MTBF  range.  At  times  other  than  those  of  peak 
demand,  errors  of  commission  could  occur.  The  effects  of 
these  might  range  from  embarrassing  (generation  of  spurious 
alerts)  to  frightening  (arming  of  an  interceptor).  Simple 
parity  checking  on  transfers,  plus  the  other  "easy"  checks, 
provide  inexpensive  insurance  against  such  errors.  Further 
as  transient  error  rates  are  unpredictable,  detection 
permits  tallying  and  indication  of  situations  where  tran¬ 
sient  rates  are  increasing,  indicating  impending  faults. 
Furthermore,  all  error  checks  are  valuable  in  the  process 
of  fault  detection  and  location,  and  probably  more  than 
just  the  "easiest"  will  be  included  for  this  purpose. 

Programmed  error  correction  may  be  performed,  but 
extra  bits  must  be  added  to  storage  and  registers  (for 
the  same  precision),  space  must  be  added  to  storage  for 
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thc  program  itself,  and,  for  peak-demand  situations,  the 
speed  must  be  increased  to  accommodate  the  extra  program 
steps . 
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Chapter  IV 

MULTIPLE  COMPUTERS  FOR  RELIABILITY 

1.  INTRODUCTION 

One  method  of  obtaining  reliability  (for  a  price)  is 
to  buy  extra  computers.  Then,  quite  simply,  when  one 
fails,  a  spare  takes  its  place  (if  the  spare  is  working). 

If  the  probability  of  at  least  one  machine  being  avail¬ 
able  is  high  enough,  and  if  there  are  ways  of  detecting 
failures  so  that  repair  can  be  initiated,  the  multiple 
computer  concept  becomes  very  attractive.  The  following 
examples  use  several  computers  where  the  actual  problem 
requires  but  one:  the  extra  machines  provide  back-up. 

Several  distinctions  must  first  be  made.  First, 
there  is  the  difference  between  "on-line"  and  "non-on-line" 
operation.  On-line  means  that  all  the  computers  are 
operating;  i.e.,  all  the  reserve  machines  are  in  the  same 
environment  as  the  one  which  is  doing  the  work.  This  is 
unlike  most  "spare  parts"  situations  where  the  spares  are 
on  the  shelf  (hence,  not  subject  to  wearout) .  In  the  fol¬ 
lowing  analysis,  the  on-line  situation  is  assumed.  This 
case  is  selected  primarily  because  the  increased  relia¬ 
bility  resulting  from  the  off-line  condition  does  not  off¬ 
set  the  continuous  error-checking  ability  available  in  the 
on-line  condition;  and  also,  it  is  unlikely  that  anybody 
would  allow  a  large  and  expensive  digital  computer  to  remain 
idle. 

Second  is  the  question  of  how  service  is  apportioned. 

It  might  be  assumed  that  just  a  fixed-service  capability 
exists  and  the  computers  are  repaired  sequentially  if  more 
than  one  fails.  On  the  other  hand,  a  flexible  amount  of 
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service  can  be  assumed  wherein  all  the  required  repairs  are 
performed  in  parallel;  both  cases  will  be  examined  here. 

Consistent  with  the  all  on-line  mode  of  operation, 
adequate  error  detection  will  be  assumed.  The  case  of 
three  or  more  computers  whose  outputs  are  majority-voted 
(see  Sec.  A-13)  will  not  be  treated  here,  since  the  duplex 
(two  computers)  method,  each  with  sufficient  self-checking, 
is  far  superior. 

As  in  Chap.  Ill,  only  the  pertinent  results  of  Ap¬ 
pendix  A  will  be  given.  The  primary  problem  is  to  ascer¬ 
tain  the  asymptotic  availability  of  the  duplex  and  multi¬ 
processor  systems  and  to  compare  the  results  with  the  re¬ 
dundant  and  non-redundant  single  computers. 

2.  THE  MULTI- PROCESSOR 

The  multi-processor  is  a  way  of  building  very  large, 
very  fast,  computing  systems  for  use  in  solving  problems 
which,  although  very  large,  need  not  be  processed  se¬ 
quentially.  That  is,  certain  parts  of  the  problem  can 
be  worked  on  at  the  same  time;  then,  perhaps,  the  results 
merged  and  again  more  processing  done  in  parallel. 

To  accomplish  this  feat,  the  notion  of  a  single, 
complete  computer  is  abandoned.  Instead  we  take  a  number 
of  memory  units,  another  collection  of  arithmetic  units, 
some  control  units,  and  enough  input/output  machinery.  If 
at  least  one  of  each  of  these  units  is  connected  together, 
a  single  computer  will  result.  Assume  now  that  there  is 
enough  switching  equipment  to  connect  enough  units  to¬ 
gether  so  as  to  form  m  separate  computers.  Also,  let 
there  be  additional  switching  to  interconnect  these  m 
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computers  so  that  they  may  communicate  with  each  other 
under  program  control  (see  Fig.  A-28).  This  is  what  is 
currently  called  a  multi-processor.  A  further  discussion 
of  such  machines  may  be  found  in  Ref.  1. 

As  usual,  the  analysis  appears  in  Appendix  A- 14,  and 
the  concern  here  is  to  compare  the  multi-processor  with 
other  systems. 

3.  SYSTEM  COMPARISONS 

The  results  derived  in  Chaps.  II  and  III,  and  Appendix 
A  permit  some  comparisons.  It  is  impossible  to  compare 
all  variations  of  the  systems  which  have  been  discussed, 
but  the  reader  can  make  his  own  comparisons  for  cases  of 
particular  interest. 

Figures  IV-1  to  IV-4  compare  the  asymptotic  avail¬ 
ability  of  five  data  processors  as  the  service  rate  is 
varied.  These  graphs  are  virtually  self-explanatory; 
in  all  cases,  .  e  multi-processor  provides  the  highest 
availability  and  a  duplex  arrangement  the  next  highest. 

The  exception  to  this  statement  is  for  very  low  service 
rates  (Fig.  IV-4,  (j.  =  .025)  where  the  availability  is 
higher  for  duplex  and  non-r edundant  machines  which  are 
very  large  'log^^NX  x  10^)  >  4.4],  As  might  be  expected, 
redundant  systems  are  never  serious  contenders,  if  service 
is  available. 

The  proper  choice  of  NX  must  again  be  made  by  the 
reader,  but  some  typical  systems  are  listed  in  Table  IV-1. 
Judicious  use  of  Tables  II-5  to  11-13  allow  a  sufficiently 
accurate  estimate  of  NX  for  any  proposed  system. 
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Fig.  IV -4 —  Comparison  of  systems  (a  =  . 025) 
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The  multi-processor  has  its  drawbacks,  however.  The 
problem  of  switching  h<._  been  alluded  to,  but  not  treated 
here.  The  reader,  when  choosing  a  value  of  N  for  a  par¬ 
ticular  multi-processor,  would  be  well-advised  to  add  as 
much  as  25  per  cent  to  his  figure  to  allow  for  the  added 
complexity  in  switching.  We  again  remind  the  reader  that 
the  potential  of  the  multi-processor  is  realized  only  for 
problems  which  can  be  partitioned  into  sub-problems,  each 
capable  of  being  worked  on  at  the  same  time.  Without  this 
feature  of  parallelism,  the  beauty  of  the  multi-processor 
is  lost. 

The  problems  of  programming  the  multi-processor  are 
also  much  more  complicated  than  those  of  the  more  conven¬ 
tional  machines.  This  subject  will  receive  considerable 
attention  in  the  next  decade. 

Table  IV- 1 

SIZE-RELIABILITY  FACTOR  FOR  VARIOUS  PART  GRADES 


Parts 

NX 

Present  computer  grade 

10 

1.83  x  10"1 

Present  computer  grade 

1 

1.83  x  10"2 

Present  computer  grade 
with  burn-in 

10 

.92  x  10"1 

Present  computer  grade 
with  burn-in 

1 

.92  x  10 

Present  high  reliability 

10 

.55  x  10-1 

Present  high  reliability 

1 

.55  x  10"2 

Predicted  1968  computer 
grade  with  burn-in 

1 

1.83  x  10-4 
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Chapter  V 

SYSTEMS  CONSIDERATIONS 


1.  INTRODUCTION 

This  chapter  considers  a  variety  of  topics  concern¬ 
ing  how  the  data  processor  is  matched  to  its  environment. 
Presumably,  we  now  possess  a  very  reliable  computing  machine, 
and  the  task  at  hand  is  to  use  it  properly.  Programming, 
interconnection  and  packaging,  extreme  environments,  and 
quality  control  are  some  of  the  remaining  problems.  This 
list  is  surely  incomplete--every  systems  man  will  have 
his  own  list  of  troublesome  thorns.  The  maintenance 
process  (which  also  includes  error  diagnostic  techniques) 
is  so  important  that  a  separate  chapter  will  be  devoted 
to  it . 

2 .  PROGRAMMING 

How  to  write  a  correct  program  remains  a  difficult 
problem- -and  one  for  which  no  formula  can  be  given.  Pro¬ 
grams  are  always  reliable  in  the  usual  sense  of  the  word, 
but  large  ones  usually  contain  undetected  errors--un- 
detected,  that  is,  until  sometime  later  a  particular  set 
of  circumstances  takes  the  program  into  an  untried  branch 
which  contains  an  error,  with  possibly  catastrophic  results. 

"Catastrophic"  is  used  here  in  the  sense  of  some 
important,  irreversible  event  which  results  from  a  program¬ 
ming  error.  Most  errors  do  not  result  in  such  occurrences, 
but  real-time  control  processes  such  as  satellite  guidance, 
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t.he  control  of  a  steel  mill,  or  ballistic  missile  defense 
are  situations  where  errors  may,  indeed,  be  catastrophic. 

Unfortunately,  although  very  large  programs  can  be 
thoroughly  tested,  the  number  of  tests  which  are  required 
for  an  exhaustive  checkout  is  astronomical  and  prohibitive. 
In  view  of  this,  the  best  that  can  be  done  is  to  set  down 
a  set  of  procedures  which,  if  faithfully  followed,  will 
significantly  reduce  programming  errors  [1,2].^ 

Errors  are  usually  divided  into  two  classes:  "coding 
errors"  and  "logical  errors."  Coding  errors  are  those  that 
might  better  be  called  clerical  errors;  the  keypunch  op¬ 
erator  hits  the  wrong  key,  a  card  is  out  of  order,  etc. 
Everybody  commits  these  errors,  but  fortunately  very  few 

of  them  go  undetected.  If  the  program  is  written  in  an 

$ 

assembly  language  (e.g.,  FAP  or  MAP),  it  will  rarely 
assemble  and  execute  in  the  face  of  a  coding  error.  On 
the  other  hand,  if  a  compiler  is  used  (e.g.,  FORTRAN  or 
ALGOL),  these  errors  are  a  little  easier  to  make.+!  But 
coding  errors  which  will  still  permit  the  program  to 
compile  and  execute  are  still  infrequent. 

f 

For  the  reader  who  wishes  to  pursue  this  subject 
beyond  what  is  given  here,  an  excellent  starting  point 
is  Hosier  [2'. 

For  the  definition  of  an  assembly  program  and  com¬ 
piler,  see  Haverty  and  Patrick  [3], 

For  example,  in  FORTRAN,  the  following  error  went 
undetected  for  six  months: 

B0  =  C  +  D 

• 

E  =  BO  +  F 

The  zero  was  used  in  the  last  line  instead  of  the  letter  0. 
FORTRAN  defines  BO  as  zero  and  executes. 
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Much  more  troublesome  and  more  frequent  are  the 
logical  errors,  primarily  because  they  must  be  discovered 
(in  cases  where  they  are  not  glaringly  obvious)  by  the 
programmer's  own  intimate  knowledge  of  what  the  program 
is  supposed  to  do.  Logical  errors  result  from  the  pro¬ 
grammer's  misreading  or  misunderstanding  the  programming 
design  specification.  For  example,  adding  two  numbers 
when  they  should  have  been  subtracted,  or  taking  a  par¬ 
ticular  branch  when  A  <  B  instead  of  A  s  B.  Just  as 
coding  errors  are  perhaps  easier  to  commit  in  a  compiler 
language,  so  logical  errors  are  easier  to  make  in  the 
assembly  language.  All  in  all,  however,  fewer  errors  will 
be  made  if  a  compiler  is  used  instead  of  writing  directly 
in  the  assembly  language. 

As  for  the  cure,  very  few  substantive  statements  can 
be  made.  "Be  caro^’l!"  and  "Double  check!"  just  aren't 
sufficient.  Hopefully,  the  items  listed  below  will  add  a 
little  toward  acquiring  error- free  programs. 

Techniques  for  Avoiding  Errors 

Documentation .  Complete  specifications  of  what  the 
program  is  supposed  to  do  (Program  Performance  Specifica¬ 
tion)  and  hew  the  program  is  to  accomplish  this  goal 
(Program  Design  Specification)  should  be  mandatory. 

Furthermore,  enough  flowcharts  and  subsidiary  docu¬ 
mentation  mus<-  be  written  so  that  bringing  in  a  new  pro¬ 
grammer  is  not  an  impossible  task.  Changing  programmers, 
or  giving  one  programmer  another's  code,  is  a  bad  situa¬ 
tion  at  best  and  should  be  avoided  when  at  all  possible. 
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Here,  also,  the  use  of  a  compiler  makes  a  substantial 
difference;  the  effort  required  to  successfully  change  pro¬ 
grammers  with  just  an  assembly  language  can  be  prohibitively 
large. 

Interface  Communication.  Means  must  be  available  by 
which  the  routines  written  by  different  programmers  can  be 
mated.  This  is  usually  done  as  the  system  grows,  but  some 
advanced  planning  could  avoid  a  lot  of  the  potential  mis¬ 
match.  Some  of  this  is  just  a  memo  circulation  procedure, 
but  much  more  important  might  be  the  use  of  common  data  file 
compiling  systems,  such  as  the  CL-1  Programming  System  and 
C0MP00L  [4-6].  In  addition  to  providing  centralized  control 
of  the  data  file,  these  systems  remove  the  artificial 
boundaries  which  are  brought  about  by  fixed-word- length 
machines.  A  programmer  may  now,  for  instance,  extend  the 
data  stored  in  the  first  eight  bits  of  a  certain  word  to 
twelve  bits  and  be  assured  that  no  trouble  will  ensue  be¬ 
cause  that  data  word  was  already  full. 

Subroutine  and  System  Debugging.  Make  sure  that  every 
subroutine  and  subsystem  operates  in  its  own  right.  This 
is  done  by  wrruing  enough  extra  code  to  check  the  sub¬ 
routine  without  attaching  it  to  a  larger  subsystem  or  the 
main  program. 

Use  of  Acceptance  Specifications.  Every  subroutine 
and  larger  subsystem  should  be  accepted  only  after  having 
passed  on  an  already-written  acceptance  specification. 

This  specification  must  be  written  on  the  basis  of  the 
performance  specification  and  not  the  design  specification. 
This  is  a  very  common  failing  and  has  resulted  in  many 
errors  going  undetected.  Again,  this  must  be  done  at  the 
subroutine  level  as  well  as  higher  in  the  system. 
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Simulation.  Extensive  use  of  simulation  techniques 
can,  and  many  times  should,  be  used  to  aid  in  debugging 
and  locating  logical  errors.  Simulation  can  be  performed 
at  many  levels--what  is  meant  here  is  really  a  type  of 
"micro"  or  "internal"  simulation  of  other  subroutines  or 
subsystems  which  will  enable  a  more  thorough  check  of  the 
program  undergoing  test.  The  next  section  will  cover  the 
topic  of  general  input  simulation  (i.e.,  simulation  of  the 
real-world  environment  in  which  the  system  will  find  it¬ 
self)  . 

3.  PROGRAMMED  ERROR  DETECTION 

One  might  now  ask  whether  or  not  subsidiary  programs 
can  be  written  which  are  used  solely  for  error  checking. 
More  specifically,  are  there  programs  which  a)  detect 
hardware  faults,  or  b)  check  the  correctness  of  the 
operational  program? 

The  answer  to  (a)  is  probably  no;  there  is  no  strictly 
programmed  method  which  will  check  the  machine.  The  usual 
procedure  is  to  run  a  large  set  of  problems  whose  solutions 
are  known,  and  then  automatically  compare  the  machine's 
solutions  with  the  correct  ones.  Technicians  run  these 
programs  during  the  morning  checkout- and-preventive- 
maintenance  period  of  moat  large  data  processors. 

Another,  more  advanced  way  in  which  the  program  can 
aid  in  error  checking  is  to  have  a  collection  of  checking 
circuits  which  are  switched  to  different  parts  of  the 
system  under  program  control,  thus  reducing  the  total 
amount  of  checking  circuitry  required.  This  method  has 
been  utilized  in  the  No.  1  Electronic  Switching  System  [7  . 
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Much  less  can  be  done  in  writing  programs  which  check 
other  programs.  By  this  is  meant  the  execution  of  a  set  of 
instructions  whose  purpose  it  is  to  verify  the  correctness 
of  another  set  of  instructions.  It  appears  to  be  a  difficult 
and  relatively  uninvestigated  area,  and  most  opinions  are  not 
very  optimistic  about  Its  future. 

A  simulation  scheme  which  exercises  the  entire  pro¬ 
gram  is  a  possibility.  Such  simulations  are,  themselves, 
large  programs  and  bring  with  them  the  problems  of  knowing 
when  the  reactions  of  the  operational  program  are  wrong. 
Obviously,  the  correct  value  of  every  variable  cannot  be 
supplied  for  every  input,  particularly  when  the  simulation 
inputs  are  generated  by  some  random  process.  "Reasonable" 
bounds  could  be  provided,  however,  and  the  program  scrutin¬ 
ized  if  any  variable  fell  outside  its  bounds. 

4.  INTERCONNECTION  AND  PACKAGING  RELIABILITY 

Interconnection  may  be  roughly  classified  in  order 
of  increasing  complexity  of  the  disconnection  process. 

A  first  general  category  includes  connections  in¬ 
tentionally  designed  for  occasional  or  relatively  fre¬ 
quent  disconnection,  such  as  for  installation,  removal, 
and  maintenance  purposes.  Multipin  plugs  and  receptacles 
and  the  various  forms  of  etched  card  connectors  fall  in 
the  group.  Multipin  plugs  and  receptacles  are  sufficiently 
familiar  as  to  need  no  further  description.  Female  con¬ 
nectors  for  etched  boards  (sockets)  vary  widely  in  detail 
design,  but  all  attempt  to  combine  some  form  of  high- 
pressure  wiping  contact  with  easy  insertion-withdrawal 
characteristics.  Male  etched  board  connectors  (plugs)  may 
be  integral  or  attached.  Integral,  or  "self,"  connectors 
are  tabs  of  the  metallic  laminate,  and  may  be  redundant 
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(duplicated  on  both  sides)  or  non-redundant ,  requiring  like 
characteristics  of  the  socket.  Attached  etched-board  plugs 
may  be  individual  male  contact  parts,  staked  and  soldered 
or  otherwise  attached,  or  complete  plug  assemblies  fastened 
to  the  board. 

A  second  category  of  interconnections  is  designed  to 
facilitate  infrequent  disconnection,  such  a?  for  correc¬ 
tion  of  assembly  errors  or  incorporation  of  changes,  while 
maintaining  good  long-term  performance  characteristics. 

This  group  includes  wire-wrap  and  taper-pin  techniques. 
Wire-wrap  involves  wrapping  several  turns  of  bare  wire 
around  a  terminal,  preferably  of  rectangular  cross-section, 
with  a  special  tool.  The  high  pressure  action  at  the 
corners  of  the  terminal  produces  a  bond  in  which  solid 
state  diffusion  may  actually  take  place,  causing  eventual 
improvement  of  the  connection  r8,9n.  Taper-pin  connections 
are  made  by  first  staking  a  tapered  sleeve  to  the  end  of 
the  wire,  then  driving  the  sleeve  into  a  mating  female 
sleeve  with  a  special  impact  tool. 

Soldering,  welding,  and  thermocompression  bonding 
comprise  a  class  of  techniques  still  permitting  discon¬ 
nection  and  reconnection,  but  at  less  convenience  than 
the  two  methods  above. 

Finally,  plating,  metallic  deposition,  solid-state 
diffusion,  and  similar  techniques  can  produce  intercon¬ 
nections  having  specific  desirable  characteristics,  but 
which  are  essentially  non-alterable. 

As  part  reliability  increases,  so  does  the  likelihood 
of  emergence  of  interconnection  unreliability  as  the 
dominant  contributor  to  failure. 
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Figure  V- 1  shows  the  schematic  and  a  possible  etched 
board  and  monolithic  integrated  circuit  layout  for  a  four- 
input  diode  NAND  circuit.  The  component  count  is  six 
diodes,  three  resistors,  and  one  transistor.  Table  V-l 
shows  the  interconnection  count,  assuming  planar  tran¬ 
sistors  and  diodes  in  individual  cases,  and  a  conventional 
can  closure  for  the  microcircuit. 

Table  V-l 

CONNECTIONS  FOR  ETCHED  BOARD  AND  INTEGRATED  CIRCUITS 


Type 

Etched  (Discrete) 

Die  attach,  bond,  weld 

23 

15 

Diffused  aluminum  contact 

8 

24 

Solder  joint 

21 

0 

Diffused  aluminum  path 

0 

10 

Copper  path 

11 

0 

External  (socket,  solder, 
weld) 

7 

7 

Total,  all  types 

70 

56 

For  discrete  circuits,  the  parts  reliability  estimates 
cover  the  first  two  items  (31  instances).  For  integrated 
circuits,  all  but  the  last  item  are  similarly  covered. 
Failure  rates  may  be  estimated  as  .020  for  the  discrete 
circuit  (four  equivalent  transistors  @  .005)  and  .010  for 
the  integrated  circuit.  If  interconnection  reliability, 
to  be  neglected,  must  be  an  order  of  magnitude  better  than 
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Fig.V-1  — Etched  board  and  monolithic  integrated  circuit  layout 


-111- 


that  of  the  parts,  then  individual  connection  reliabil¬ 
ities  must  be  (ignoring  the  copper  paths) : 

.002/28  =  .00007  for  the  discrete  circuit; 

.001/7  =  .00014  for  the  integrated  circuit. 

With  suitable  care  these  figures  are  apparently 
achievable  for  wire  wrap,  soldered,  and  welded  inter¬ 
connections,  as  indicated  in  Appendix  E.  Taper  pin  con¬ 
nections  are  probably  somewhat  less  reliable,  but  are  be¬ 
coming  unpopular  due  to  topological  limitations  and  the 
impossibility  of  automatic  assembly.  Connector  data  ob¬ 
tained  from  Ref.  10  is  probably  misleading  for  modern 
etched-board  connectors.  One  source  [7]  indicates  the 
feasibility  of  etched-board  connector  failure  rates  com¬ 
pletely  compatible  with  best  semiconductor  device  rates, 
using  only  non-redundant  self-connectors.  Although  brute- 
force  life-test  verification  is  not,  and  may  never  be, 
available,  one  may  reasonably  assume,  on  the  available 
evidence,  that  deposited  aluminum  interconnections  and 
thermocompression  bonds  (again  with  the  proper  control) 
can  achieve  reliabilities  equal  to  or  better  than  the  more 
gross  methods.  One  technique  under  development  uses  all- 
diffused  interconnections,  mounting  silicon  integrated 
circuits  on  a  silicon  "carrier  board."  Potential  relia¬ 
bility  is  clearly  as  good  as  that  of  the  processes  forming 
the  active  elements  themselves,  though  reduction  to 
practice  may  present  considerable  difficulty. 

The  degradation  or  failure  indication  of  an  inter¬ 
connection  is,  of  course,  increased  resistance,  which  in 
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turn  results  from  increased  resistivity  or  decreased 
cross-section  of  the  conductive  path.  Electrochemical 
and  thermochemical  changes  such  as  electrolysis,  oxida¬ 
tion,  and  corrosion  may  either  increase  resistivity  or 
decrease  cross-section.  Partial  detachment  through 
physical  stress  decreases  cross-section.  The  solder  joint 
requires  more  care  than  the  others  in  that  cleaning  is 
required  to  remove  flux  residues. 

It  should  be  noted  that  failure  ascribed  to  inter¬ 
connections  may  result  either  from  static  or  dynamic 
effects--it  may  be  caused  by  the  interconnection  design, 
as  well  as  interconnection  degradation.  The  problems  of 
interconnection  design  for  optimum  dynamic  behavior  have 
been  covered  in  the  literature  [11],  but,  with  respect 
to  noise  considerations,  each  new  system  requires  its  own 
analysis . 

5.  EXTREME  ENVIRONMENTS 

Packaging  reliability,  for  ground-based  equipment  in 
a  controlled  atmosphere,  need  not  be  given  detailed  con¬ 
sideration  here,  unless  protection  against  physical  shock 
from  nearby  explosions  is  required.  The  parameters  of 
the  shock  may  be  estimated,  and  conventional  shock-isolation 
procedures  incorporated.  Protection  against  short-term 
high- intensity  radiation  involves  consideration  of  two 
ef fects--permanent  damage,  and  transient  behavior  [12-14], 

With  respect  to  permanent  damage,  the  type,  intensity, 
and  duration  of  radiation  becomes  a  degradation/failure 
stress  factor  to  be  incorporated  in  part  selection,  assign¬ 
ment  of  end-of-life  design  parameter  limits,  and  estimation 
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of  failure  rate.  If  the  estimated  failure  rate  or  spread 
of  degradation  limits  for  available  part-types  is  in¬ 
tolerable,  some  form  of  shielding  will  be  required. 

A  transient  error  or  fault  resulting  from  radiation 
is  no  different  from  that  caused  by  any  other  unanticipated 
stress;  all  previous  comments  on  this  subject  apply. 

6.  THE  ROLE  OF  THE  MANUFACTURER 

A  final  system  consideration  is  essentially  the  sum- 
ma  Lon  and  reiteration  of  a  number  of  allusions  elsewhere 
in  tuis  report--specif ically,  the  integrity  of  the  manu¬ 
facturer.  An  entire  organization  might  possess  an  at¬ 
titude  which  operates  essentially  to  the  detriment  of  the 
product.  Conversely,  organizations  exist  which  evidence 
a  technical  and  administrative  orientation  which  tends 
to  guarantee  production  of  outstanding  systems.  No  method 
of  computing  mean- time-between- failure  can  take  into 
account  intangibles  such  as 

o  Lack  of  firm  technical  policies  in  design,  ma¬ 
terials  selection,  handling,  and  assembly,  and 
checkout ; 

o  Lack  of  understanding  of,  or  agreement  with, 

said  policies  on  the  part  of  the  technical  team; 

o  Emphasis  on  quantity,  rather  than  quality,  in 
engineering  personnel--of ten  the  result  of  an 
over-aggressive  expansion-oriented,  marketing 
pol icv ; 

o  Misunderstanding  of  the  roles  of  inspection, 

quality  control,  production  control,  reliability, 
and  PERT,  and  compounding  of  the  felony  by  the 
introduction  of  even  more  esoteric  groups  and 
philosophies  (often  called  "testing  quality  into 
the  product"). 
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Chapter  VI 

THE  PROCESS  OF  MAINTENANCE 


1 .  INTRODUCTION 

Whether  downtime  starts  with  a  failure  or  the  detec¬ 
tion  of  failure  is  a  philosophic  distinction  similar  to  that 
outlined  at  the  start  of  Chap.  II.  Ideally,  detection  of 
failure  would  always  precede  any  erratic  output  action, 
and  approaches  to  this  ideal  have  been  discussed.  This 
chapter  essentially  concerns  the  period  between  detection 
of  failure  and  restoration  of  the  system  to  on-line  opera¬ 
tional  capability--che  ’’corrective  maintenance  period." 
Corrective  maintenance  differs  from  preventive  maintenance 
in  that  it  is  unscheduled.  It  occurs  whenever  the  fault 
is  detected  rather  than  when  it  (maintenance)  is  scheduled. 

Figure  VI-1  is  a  flowchart  of  the  field  corrective 
maintenance  process.  The  time  from  (FAULT  IS  DETECTED)  to 
re-entry  into  the  operating  state  is  called  the  "time  to 
repair"  (TTR) ,  or,  better  though  rare,  the  "time  to  restore." 
The  time  from  (FAULT  OCCURS)  to  (F \ULT  IS  DETECTED)  might 
be  called  the  "time  to  detect,"  and  will  be  assumed  neg¬ 
ligibly  small  relative  to  any  TTR  or  time  between  failures 
(TBF) . 

The  chart  contains  some  implicit  simplifying  assump¬ 
tions:  a)  The  diagnostic  period  always  ends  prior  to 

arrival  of  personnel,  and  b)  human  diagnosis  always  follows 
the  orderly  pattern  of  system- to-group- to-module. 

The  various  actions  in  the  chart  will  be  considered 
sequentially,  commencing  with  fault  detection.  Detection 
of  a  permanent  error  by  any  means,  as  described  in  Sec.  V-3 


-115- 


( Expressions  in  parentheses  are  exit  conditions) 

Fig. VI-1 — Flow  chart  of  field  corrective  maintenance  process 
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is ,  of  course,  equivalent  to  detection  of  a  fault.  Special 
built-in  fault  detection  circuits  are  available  which 
permit  more  rapid  detection  of  situations  which,  if  un¬ 
checked,  may  lead  to  multiple  failures  (such  as  overvolt- 
undervolt  relays  on  power  supplies  and  temperature  inter¬ 
locks)  , 

Direct  programmed  fault  detection  is  possible  only 
if  some  built-in  provisions  are  made  to  convert  electrical 
signals  to  a  form  accessible  as  computer  information.  The 
usual  problem  exists  in  the  case  of  a  control  fault  which 
does  not  permit  the  program  to  run. 

Diagnostic  (fault  location)  methods  may  also  use  a 
combination  of  built-in  and  programmed  implementation. 

A  built-in  fault  detection  scheme  often  provides  location 
automatically. 

2 .  DIAGNOSTIC  TECHNIQUES 

Success  in  detecting  and  locating  faults  at  an  op¬ 
erational  site  depends  on  a  number  of  factors: 

o  The  time  available  to  perform  the  necessary 
functions ; 

o  The  built-in  features  of  the  system  which  aid 
in  the  diagnosis; 

o  The  level  to  which  fault  location  is  required; 

o  The  quality  of  the  personnel. 

Totally  automatic  approaches  that  will  detect  and 
locate  any  possible  system  fault  down  to  the  desired 
replaceable/repairable  level  are  not  yet  available. 
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However,  a  number  of  automatic  approaches  are  in  use  that 
will  effectively  handle  a  major  sub-set  of  the  possible 
failures . 

A  conventionally  designed  computer  requires  a  large 
portion  of  the  computer  to  be  in  working  condition  if  the 
computer  is  to  be  capable  of  executing  a  program.  This 
portion  is  usually  30  per  cent  or  more  of  the  computer 
circuits^ 

Therefore,  when  developing  a  computer  program  that 
will  automatically  diagnose  failure  (at  least  in  the  com¬ 
puter  main  frame) ,  one  must  remember  that  a  large  portion 
of  the  machine  must  be  in  working  order  merely  to  run  the 
program  correctly.  Moreover,  if  a  fault  exists  in  the 
necessary  basic  circuits,  nothing  more  than  a  very  gross 
isolation  of  the  failure  is  possible. 

Another  drawback  of  conventional  computers  is  the 
inaccessibility  of  the  various  registers.  In  general,  the 
only  registers  available  for  comparison  purposes  are  the 
accumulator  and  multiplier/quotient  registers.  The  contents 
of  all  other  registers , must  be  deduced.  Of  course,  most 
machine  consoles  have  display  capabilities  for  these  other 
registers,  but  are  not  usually  accessible  to  the  machine's 
comparative  circuits.  Furthermore,  one  cannot  set  the 
internal  state  of  a  conventional  computer  from  an  external 
location  (e.g.,  the  console).  Generally,  the  ability  to 
set  the  internal  state  is  limited  to  that  which  is  required 


The  value  30  per  cent  applies  to  conventional  designs. 
As  shown  later,  significant  decreases  in  this  fraction 
(usually  called  the  "hardcore")  can  be  achieved. 


-118- 


for  program  running  and  monitoring,  and  the  hardware  is 
assumed  to  be  functioning  properly.  Hence,  the  state  of 
the  machine  after  any  operation  is  unknown  unless  the 
machine  is  operating  without  errors. 

Three  basic  approaches  to  diagnostic  programs  for 
computer  mainframes  exist.  Generally  speaking,  the  part 
of  the  program  devoted  to  fault  detection  is  very  similar 
in  all  three  cases.  The  methods  differ  primarily  in  the 
techniques  used  for  fault  location  and  the  extent  to  which 
this  is  achieved  automatically. 

Method  1 


The  first  approach  is  the  one  most  commonly  used  in 
both  commercial  and  military  installations.  This  type 
of  program  is  characterized  by  the  following  features  and 
assumptions : 

o  The  set  of  circuits  functioning  properly  is 
sufficient  to  enable  the  program  to  operate; 

o  Only  single  faults  will  occur; 

o  Fault  isolation  can  be  deduced  from  a  set  of 
fault  indications  for  all  possible,  acceptable 
failures  (i.e.,  it  rarely  says  what  the  indica¬ 
tions  of  the  failure  were) ; 

o  The  technical  level  of  the  technicians  and  the 
time  available  is  adequate  to  deduce  failure 
locations ; 

o  Few  or  no  additional  circuits  are  included  in 
the  system  to  perform  fault  location; 

o  It  does  not  attempt  to  exhaustively  test  the 
logic;  rather,  it  performs  a  "sufficient"  number 
of  tests  with  "randomly"  generated  data  and 
specially  designed  sample  patterns. 
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A  description  of  a  program  of  this  type  is  given  by 
Bashow  1  . 

This  approach  is  probably  inadequate  for  field- 
operational  data  processors,  as  it  requires  extremely 
competent  and  well-trained  technicians  for  effective 
application. 

Moreover,  about  30  per  cent  of  the  computer  is  not 
treated  at  all  by  this  approach.  Since  the  circuits  in 
that  30  per  cent  are  not  basically  different  from  the 
rest  of  the  system,  about  30  per  cent  of  the  faults  can 
not  be  treated  without  the  addition  of  extensive  testing 
equipment. 

This  approach  is  also  characterized  by  the  lack  of 
knowledge  of  the  effectiveness  of  the  procedure.  The 
test  routines  are  heuristically  developed  and  no  methods 
exist  to  test  their  completeness.  This  would  lead  to 
continuous  debugging  and  updating  of  the  program  even  in 
the  absence  of  the  inevitable  system  modifications. 

The  results  claimed  by  Bashow  [1]  clearly  underscore 
the  inadequacy  of  this  application  for  field  applications. 
Of  34  actual  tests,  only  23  were  adequately  diagnosed  by 
the  program.  No  further  statistics  were  given  as  to  the 
efficacy  of  the  program. 

Method  2 

The  second  approach,  used  to  some  extent,  utilizes 
a  second  computer  to  test  the  first  computer.  The  follow¬ 
ing  features  characterize  it. 

o  Fault  detection  and  isolation  is  totally  automatic 
(to  whatever  extent  the  program  is  capable  of 
such  detection  and  isolation). 
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o  A  great  deal  of  pre-processing  is  required  to 

accumulate  the  data  necessary  to  realize  detailed 
fault  isolation.  This  is  accomplished  both  by 
simulation  and  by  actual  introduction  of  failures. 

o  Some  additional  hardware  is  required  in  the  com¬ 
puter  to  effect  the  interconnection  of  the  two 
computers . 

o  Generally,  only  single  faults  are  assumed  to 

exist,  although  multiple  faults  could  be  handled. 

o  Each  fault  is  associated  with  a  given,  unique 
sequence  of  failure  indications  that  can  be 
monitored  by  the  "good"  computer. 

o  It  attempts  to  exhaustively  test  the  logic  and 
to,  at  least,  detect  all  possible  errors. 

o  It  assumes  that  the  technical  level  of  the  tech¬ 
nicians  and/or  the  time  available  is  low. 

The  multiple  computer  approach  is  more  suited  to 
field  diagnosis.  However,  it  presents  several  problems. 
First  is  the  cost  of  generating  the  fault-location  data 
and  of  maintaining  the  data  (i.e.,  keeping  it  current  with 
state  of  the  system). 

Another  problem  is  determining  which  faults  to  con¬ 
sider.  Once  the  set  of  acceptable  faults  is  selecced, 
the  entire  structure  of  the  maintenance  procedure  is  fixed. 
No  other  faults  can  be  treated,  and  any  other  fault  may 
be  improperly  diagnosed  as  one  belonging  to  the  acceptable 
set.  This  could  lead  to  great  difficulties  in  actual 
maintenance  because,  although  the  average  repair  time  might 
be  very  low,  the  maximum  repair  time  is  likely  to  be  ex¬ 
tremely  high,  perhaps  a  matter  £  days. 
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Tsiang  and  Ulrich  [2]  give  the  following  statistics: 
For  a  computer  containing  8000  circuit  packages  having 
6500  transistors  and  45,500  diodes,  the  diagnostic  pro¬ 
gram  required  7200  words,  and  50,000  faults  were  treated. 
Only  single  failures  were  considered. 

For  75  per  cent  of  the  included  faults,  location  is 
to  one  circuit  package  and  for  13  per  cent,  location  is  to 
two  packages.  Probably  75  per  cent  of  the  failures  are 
locatable  with  this  method. 

The  fault  diagnostic  information  was  generated  by 
actually  creating  the  failure  in  the  hardware  rather  than 
by  simulation.  The  generated  data  consisted  of  about  60 
million  bits  and  was  reduced  to  a  dictionary  of  1290 
11x15  pages.  The  project  required  about  13  man-years  and 
250  hr  of  machine  time. 

Method  3 


This  rare  approach  is  characterized  by  the  following 
features : 

o  The  computer  diagnoses  itself;  however,  additional 
circuits  are  provided  to  set  and  reset  all  (or 
most)  storage  devices  "directly,"  to  read  the 
contents  of  all  (or  most)  storage  devices  "directly," 
and  to  provide  alternate  control  paths; 

o  A  fairly  small  percentage  of  the  circuits  (~10  per 
cent)  must  be  functioning  properly  to  allow  the 
program  to  operate; 

o  Fault  detection  is  totally  automatic,  but  fault 
location  is  only  partially  automatic; 

o  A  great  amount  of  pre-processing  is  required  to 
accumulate  data  for  automatic  fault  location; 
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o  The  approach  can  handle  all  single  faults  and 
some  multiple  faults; 

o  Faults  are  located  by  observing  the  poin*  at 
which  the  test  failed  (gross  isolation)  and  the 
resulting  indications,  and  then  deducing  the 
cause  by  referring  to  appropriate  documents; 

o  Testing  and,  consequently,  fault  detection  are 
exhaustive; 

o  It  assumes  a  fair  level  of  technical  competence 
from  the  technicians  and  a  fair  amount  of  avail¬ 
able  time. 

This  method  has  many  features  of  merit.  A  descrip¬ 
tion  of  a  system  of  this  type  is  given  by  Carter  [3]. 

The  powerful  control  capability  that  can  be  exercised  over 
the  computer  is  particularly  attractive.  It  allows  the 
development  of  simple  test  procedures  by  sub-dividing  the 
system  into  manageable  chunks.  Most  of  the  tests  and 
the  data  needed  to  analyze  the  results  can  be  automatically 
generated  from  the  data  stored  in  design  automation  files. 
This  leads  to  a  minimum  of  transcription  errors,  and  com¬ 
pleteness  in  thr  testing  procedures.  It  also  allows  auto¬ 
matic  updating  ot  diagnostics  in  response  to  system  changes. 

The  method  has  drawbacks,  however.  The  hard  core 
which  cannot  be  reached  by  the  program  has  been  reduced  tc 
ten  per  cent,  but  this  ten  per  cent  still  must  be  handled 
by  other  means.  The  enormous  cost  must  also  be  considered 
a  drawback,  if  not  an  absolute  deterrent.  Over  100,000 
instructions  are  required  for  the  IBM  System/360  and  the 
cost  might  conservatively  be  $500,000.  Finally,  fairly 
sophisticated  technicians  will  be  needed  to  implement 
this  method. 
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Fur  t her  consideration  here  of  these  methods  is  un¬ 
profitable.  The  nature  of  the  service,  the  allowable 
budget,  and  the  value  of  very  low  repair  times  all  must 
be  estimated  before  recommendations  can  be  made. 

Much  work  has  been  '.'one  on  error  diagnosis  and  fault 
location  which  cannot  be  reported  here.  The  general  theory 
of  diagnosis  in  switching  and  sequential  circuits  is  dis¬ 
cussed  in  Refs.  4-8.  The  subject  of  transient  and  inter¬ 
mittent  errors  and  the  ability  to  re-examine  the  program 
has  not  been  considered,  but  information  may  be  found  in 
Ref.  3.  Elegant  methods  such  as  those  being  developed 
for  IBM  System/360  have  a  possible  dividend  in  that  logical 
design  errors  are  also  sometimes  revealed  [3,4]. 

3.  THE  MAINTENANCE  MODULE 

The  process  of  fault  location  is  significantly  af¬ 
fected  by  the  size  of  the  maintenance  module.  If  the 
module  is  an  entire  computer,  fault  location  becomes 
identical  with  detection.  This  creates  a  rather  bulky 
and  expensive  module,  however.  At  the  other  extreme  is 
the  single  part  as  a  maintenance  module.  Fault  location 
to  a  single  part  at  field  level,  and  the  problems  as¬ 
sociated  with  installation,  offset  the  portability  and 
economy  of  the  single-part  module. 

As  an  attempt  to  evaluate  the  tradeoffs  involved  in 
module  size  selection,  consider  a  computer  composed  of 

100,000  transistors  of  25  types, 

150,000  diodes  of  25  types, 

150,000  composition  resistors  of  50  types, 

50,000  film  resistors  of  50  types, 

50,000  mica  capacitors  of  50  types. 
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The  total  is  500,000  parts  of  200  types.  An  additional 
200  types  might  be  required  in  other  categories  (paper 
and  electrolytic  capacitors,  zener  diodes,  transformers), 
making  400  types  altogether. 

When  parts  are  combined  with  plug-in  modules,  they 
will  be  partitioned  into  various  numbers  of  modules  of 
various  types,  which  will  determine  the  nature  of  a  com¬ 
plete  set  of  spares.  Although  module  partitioning  sta¬ 
tistics  are  not  readily  available,  th :>  few  instances 
discovered  led  to  formation  of  the  f  blowing  highly 
hypothetical  partitioning  rules: 

°  ?alf,:0f  the  modules  of  the  system  are  replaceable 
by  16  types; 

o  Half  of  the  remaining  modules  are  replaceable  by 
16  more  types;  3 

°  And  soon,  until  the  remainder  is  less  than  32, 
when  all  remaining  modules  are  assumed  to  be 
unique. 

For  the  500,000-part  computer,  some  partitioning 
options  are  listed  in  Table  VI-1,  assuming  an  average  part 
cost  of  $1.00.  For  the  one-part  module,  the  partitioning 
rule  gives  238  types,  but  the  assumed  number  of  part  types 
was  used  instead.  Note  that  "cost"  includes  parts  cost 
only.  The  more  complex  modules  would  be  more  expensive, 
due  to  assembly  labor  and  burden. 

The  most  significant  factor  to  balance  against  cost 
of  a  set  of  spare  parts  is  time  to  diagnose  to  the  main¬ 
tenance  module  level.  Estimates  of  diagnostic  time  vs. 
maintenance  module  size  are  at  least  as  nebulous  as  the 
partitioning  rule,  but  one  possibility  is  to  hypothesize 


-125- 


Table  VI-1 

SPARE  PARTS  COST  VS ,  NUMBER  OF  MODULES 


Modules 

Module  Types 

Parts/Module 
(=Cost  in  $) 

Parts  Cost  ot 
•Spare  Set,  $ 

1 

1 

500,000 

500,000 

5 

5 

100,000 

500,000 

50 

41 

10,000 

410,000 

500 

79 

1,000 

79,000 

5,000 

102 

100 

10,200 

50,000 

185 

10 

1,850 

500,000 

400 

1 

400 

that  diagnostic  times  are  related  as  the  logarithm  (base 
10)  of  the  module  size  ratio.  This  yields  the  estimate 
shown  in  Table  VI- 2. 

Figure  VI-2  is  a  plot  of  the  relative  cost  of  a  set 
of  spares,  and  relative  time  to  diagnose.  Also  shown  is 
an  equal-weighted  sum,  indicating  that  for  any  partitioning 
rule,  diagnosis  time  rule,  or  weighting  of  cost-of-spares 
versus  cost-of-diagnostic- time ,  there  should  be  some 
optimum  region  of  maintenance  module  size. 

A  final  factor  in  the  corrective  maintenance  flow 
is  the  nature  and  amount  of  test  equipment  for  field 
diagnosis,  and  the  extent  to  which  system,  logic,  and 
circuit  design  facilitate  this  activity.  In  early  com¬ 
puters,  this  system  aspect  was  entirely  neglected  on  the 
assumption  that  a  clever  field  engineer  with  an  oscillo¬ 
scope  could,  sooner  or  later,  figure  out  what  was  wrong. 
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Fig. VI-2 — Relative  cost  and  diagnosis  time  versus  parts  per  module 
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Table  VI-2 

ESTIMATED  DIAGNOSTIC  TIME  VS.  NUMBER  OF  MODULES 


Parts  in 
Module 

Relative 

Diagnostic 

Time 

Per  cent 
of 

Maximum 

1 

5.7 

100 

10 

4.7 

82.5 

100 

3.7 

65.0 

1,000 

2.7 

97.4 

10,000 

1.7 

29,8 

100,000 

0.7 

12.3 

500,000 

0.0 

0.0 

Currently,  all  enlightened  digital  systems  manufacturers 
admit  the  need  for  compromising  system  design  relative 
to  operational  specifications,  to  permit  rapid,  hazard- 
free  field  diagnosis.  Requirements  extend  from  provision 
for  connection  of  test  equipment  to  make  "accidents"  im¬ 
possible,  to  design  of  specialized  analysis  equipment 
brought  into  the  site  when  self-diagnosis  fails.  As  this 
is  usually  the  case  when  control  faults  occur,  system 
provisions  for  forcing  control  states  and  sequences  as 
well  as  for  setting  register  contents  are  usually  included. 
Many  other  features  of  various  degrees  of  sophistication 
may  be  provided,  assuming  some  sort  of  cost/risk  tradeoff 
evaluation  underlying  each  decision. 
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The  topics  of  depot  maintenance,  spare  parts  logis¬ 
tics,  and  personnel  training  are  significant,  but  are  be¬ 
yond  the  scope  of  this  report. 

4.  PREVENTIVE  MAINTENANCE 

In  most  practical  systems,  certain  items  of  equip¬ 
ment  are  necessary,  yet  have  failure  rates  so  high  that 
they  completely  dominate  the  system  failure  behavior. 
Notable  examples  are  magnetic  tape  transports  and  type¬ 
writers.  For  systems  requiring  extreme  reliability,  three 
solutions  to  this  problem  are  available: 

o  Eliminate  such  devices  from  the  on-line  portion 
of  the  system,  relying  on  them  only  for  peripheral 
functions  not  essential  to  the  primary  objectives; 

o  Provide  some  form  of  redundancy; 

o  Design  a  preventive  maintenance  plan. 

Preventive  maintenance  introduces  scheduled  periods 
of  downtime  of  controlled  duration  to  decrease  instances 
of  non-scheduled  periods  of  downtime,  with  a  net  saving 
in  total  downtime.  Preventive  maintenance  periods  initiate 
certain  standard  procedures  such  as  adjustments,  cleaning, 
and  lubrication,  as  well  as  specially  designed  tests  which 
indicate  the  relative  degradation  of  some  parts.  This 
latter  process,  called  marginal  checking,  permits  replace¬ 
ment  of  parts  showing  anomalous  degradation  before  failure 
actually  occurs.  If  the  marginal  check  causes  a  "weak" 
part  to  fail,  it  is  really  a  form  of  latter-day  burn-in. 
More  often,  the  effect  of  a  marginal  check  is  to  introduce 
a  transient  fault  which  is  then  diagnosed  in  the  usual 
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manner.  Typical  marginal  checks  for  solid-state  digital 
systems  include: 

o  Increase  of  clock  rate; 

o  Decrease  of  a  voltage  to  which  the  no'se  immunity 
is  monotonically  related; 

o  Decrease  of  a  voltage  to  which  the  logic  signal 
swing  is  monotonically  related; 

o  Introduction  of  simulated  transistor  I  ; 

CBO 

o  Simulated  reduction  of  transistor  h  . 

FE 

The  effectiveness  of  any  such  measures  depends  on  the 
expected  degradation  behavior  and  the  response  of  the 
parts  to  the  simulated  enhancement.  If  occasional  voltage 
breakdown  were  a  failure  mode,  raising  the  supply  voltage 
as  a  marginal  check  would  be  a  disastrous  election,  due 
to  the  severely  nonlinear  behavior  near  breakdown. 

Clock  rate  increase  is  easily  accomplished,  either 
continuously  or  in  one  or  more  increments.  The  second 
and  third  items  depend  on  the  exact  design--that  is,  the 
availability  of  a  single  voltage  supply  exhibiting  the 
required  relationship.  In  some  systems,  the  third  and 
fourth  items  above  are  accomplished  by  adding  a  resistor 
to  the  base  of  each  transistor.  The  free  ends  of  the 
resistors  may  be  connected  together  by  a  single  marginal 
check  bus  which,  for  one  polarity  of  applied  voltage 
injects  artificial  ICBQ,  and  for  the  other  polarity  "robs" 
drive  current,  thus  requiring  higher  h?E  for  performance. 
The  penalties  for  this  approach  are  a)  added  stray  ca¬ 
pacitance  at  the  base  for  high-speed  circuits,  and  b)  loss 
of  current  and  possible  intercircuit  coupling  when  in  the 
normal  operating  mode. 
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Another  approach  is  to  remove  maintenance  modules 
from  the  computer  and  test  them  individually  on  a  marginal 
checker.  This  practice  should  be  strongly  discouraged, 
as  tampering  in  this  way  with  an  otherwise  operative  system 
is  probably  more  conducive  to  failure  than  normal  part 
behavior . 

The  "cleaning,  adjustment,  and  lubrication"  type  of 
preventive  maintenance,  where  required,  must  obviously 
be  provided.  Marginal  checking  should  be  provided  only 
if  justified  by  overall  maintenance  logistics,  including 
the  effects  of  multiplicity  of  systems  or  subsystems  for 
redundancy. 

All  the  above  approaches  (except  module  removal) ,  may 
be  carried  out  automatically,  permitting  fast  return  to 
the  operational  mode,  on  demand. 
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VI T. _ SUMMARY 


1.  INTRODUCTION 

This  section  summarizes  briefly  but  quantitatively  the 
contends  of  this  report,  while  the  Introduction  (Chap.  I) 
presents  the  motivation  for  the  work,  gives  the  scope  of 
the  report,  and  defines  certain  key  terms.  The  reader  who 
requires  only  the  conclusions  and  a  few  numbers  and  is 
willing  to  accept  them  without  proof  should  find  this  summary 
sufficient  for  his  needs.  Topics  of  lesser  importance,  or 
those  which  are  too  difficult  to  summarize  succinctly  will 
only  be  cited  here,  and  a  reference  to  their  location  in 
this  Memorandum  ..ill  be  given. 

Tables  and  figures  given  here  sometimes  duplicate 
material  which  appears  in  the  text.  Where  this  threatens 
to  be  excessive,  the  reader  is  referred  to  the  text  for  a 
specific  figure.  Recommendations  are  not  duplicated  here; 
again,  the  reader  will  be  referred  to  a  specific  location 
in  the  report,  and  in  all  cases  this  summary  supplies  suf¬ 
ficient  context  for  their  understanding. 

In  this  report,  the  authors  attempt  to  estimate  the 
availability  of  a  data  processor  by  starting  with  the 
smallest  part  and  carefully  investigating  its  failure 
behavior. +  They  concurrently  construct  a  mathematical 
model  of  system  availability  which  gives  the  desired  re¬ 
sults  for  a  wide  variety  of  systems  _if  the  failure  behavior 


Definitions,  e.g.,  " 
here;  they  appear  in  Chap, 


availability,"  are  not  repeated 
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of  the  part,  the  service  features,  and  the  size  of  the 
system  are  known.  These  two  efforts  merge  and  estimates 
of  current  and  predicted  systems  availability  appear  as  a 
function  of  the  size  of  the  processor  and  the  type  of 
service  provided. 

In  addition  to  this  central  theme,  developed  in  Chaps. 
II  through  IV  and  Appendices  A  through  E.  many  more  quali¬ 
tative  topics  receive  study  in  varying  degrees  of  thorough¬ 
ness.  Among  the  more  important  ones  are  circuit  design, 
logical  design,  programming,  and  maintenance  (Chaps.  Ill, 

V,  VI).  The  purpose  here  is  to  relate  these  topics  to 
machine  availability  and  to  recommend  improvements. 

The  authors  know  no  way  of  quantitatively  relating 
circuit  design,  logical  design,  programming,  or  maintenance 
to  availability.  What  is  known,  without  equivocation,  is 
that  lack  of  care  in  any  of  these  areas  causes  the  processor 
to  be  down  for  unwarranted  lengths  of  time--long  after  the 
entire  system  has  been  accepted  and  declared  operational. 

2.  PARTS  AND  A  DEFINITION  OF  FAILURE 

After  a  brief  description  of  which  parts  are  treated 
(transistors,  resistors,  capacitors,  integrated  circuits, 
but  not  mechanical  devices  of  any  kind,  except  connectors), 
a  discussion  of  failure  and  its  many  meanings  is  given. 
Paraphrasing  Chap.  II,  we  conclude  that  a  part  has  failed 
when,  under  some  combination  of  normally  applied  stresses, 
one  or  more  parameters  of  the  part  vary  in  such  a  way  that 
the  functional  group  containing  the  part  does  not  perform 
its  role.  The  point  which  needs  emphasis  is  that  the  con¬ 
cept  of  failure  has  no  meaning  apart  from  a  concept  of 
proper  circuit  function. 
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3 .  PART  FAILURE  MECHANISMS  AND  DISTRIBUTIONS 


Chapter  II  continues  by  discussing  the  manner  in  which 
part  parameters  rebate  to  stresses,  and  this  leads  to  a 
discussion  of  the  two  current  methods  of  handling  relia¬ 
bility  problems:  a)  statistical  analysis,  and  b)  the 
physics-of-failure  approach. 

The  origin,  modes,  and  mechanisms  of  failure  are  in¬ 
vestigated  and  some  conclusions  are  reached  about  the 
existence  and  nature  of  failure  rates.  These  conclusions 
may  be  summarized  as  follows: 

o  Some  form  of  decreasing  failure  rate  for  the  total 
part  population  will  be  observed  which  is  in  no 
way  correlated  to  the  predicted  behavior  of  the 
ideal  part. 

o  The  total  part  population  shows  a  decreasing 
failure  rate  because,  and  only  because,  various 
controllably  small  subgroups  show  initially  in¬ 
creasing  failure  rates  of  various  forms  until 
every  member  of  the  subgroup  has  failed,  at  which 
time  the  failure  rate  of  the  entire  remaining 
population  effectively  decreases. 

o  Eventually,  a  universal  "wearout"  mechanism  (e.g., 
diffusion  processes  for  solid  state  devices)  will 
cause  the  failure  of  all  parts.  This  process 
has  an  increasing  failure  rate,  but  systems  normally 
operate  so  far  out  on  the  "left  tail"  of  this  dis¬ 
tribution  that  it  is  not  a  factor. 

Arguments  are  presented  for  the  case  of  continuing  to 
use  the  exponential  failure  distribution  in  the  face  of 
this  evidence,  and  Chap.  II  goes  on  to  estimate  the  param¬ 
eters  of  the  decreasing  failure  rate  Weibull  distribution 
from  the  available  evidence.  Those  estimates  give  marginal 
confidence,  to  say  the  least. 
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4.  STANDBY  CONDITIONS 

It  is  further  concluded  that  there  are  no  compelling 
reasons  to  maintain  the  computer  (assuming  it  is  not  needed) 
in  a  quiescent,  power-off  condition.  Certainly  most  of  the 
parts  would  benefit  from  special  no-power  storage,  but  the 
lack  of  error  checking  and  the  danger  from  power  transients 
suggest  an  advantage  in  keeping  the  system  on.  Last  but 
not  least,  the  likelihood  of  permitting  a  large,  very 
expensive  digital  computer  to  remain  inoperative  in  order 
to  gain  a  little  reliability  is  quite  small. 

5.  THE  "EQUIVALENT  TRANSISTOR"  COMPUTER  AND 

SOME  TOPICAL  SYSTEMS 

Assuming  that  a  computer  consists  primarily  of  tran¬ 
sistors,  resistors,  and  capacitors,  each  with  known  re¬ 
liability,  it  is  possible  to  treat  the  same  machine  (from 
the  standpoint  of  reliability)  as  one  which  is  constructed 
entirely  of  transistors,  by  computing  how  many  resistors  (or 
capacitors)  it  takes  to  provide  a  failure  rate  just  equal 
to  one  transistor,  then  exchanging  parts  in  this  ratio. 

To  be  more  accurate,  more  than  one  category  of 
resistor  and  capacitor  should  be  employed,  since  the 
failure  rate  of  a  part  also  depends  on  its  use  in  the 
circuit.  This  complicates  matters  to  the  extent  that  the 
relation,  say,  between  resistor  failure  rate  and  the 
stresses  applied  to  the  resistor  in  its  circuit  must  be 
known  in  considerable  detail--a  nontrivial  task  in  most 
cases.  When  this  is  done,  the  size  of  a  computer  is 
measured  in  "equivalent  transistors."  The  complexity  of 
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some  existing  systems  using  this  measure  is  repeated  in 
Table  VIT-1. 


Table  VII- 1 

COMPLEXITY  OF  EXISTING  SYSTEMS 


System 

Complexity  in 
Equivalent  Transistors 

FSQ-32 

3 

383  x  10 

FSQ-31V 

274  x  103 

CDC-3600 

97  x  103 

CDC-1604A 

82  x  103 

Univac  1107 

78  x  103 

Burroughs  B-5000 

67  x  103 

Honeywell  H-1800 

49  x  103 

Honeywell  D-825 

41  x  103 

SDS  9300 

35  x  103 

USQ-20 

30  x  103 

IBM  7090/44 

26  x  103 

GE  215/225/235 

22  x  103 
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6.  THE  UNIT  PARTS  COMPLEMENT 

To  get  an  even  more  tractable  way  of  describing  the 
size  of  a  computer,  the  concept  of  a  unit  parts  complement 
is  introduced  in  Chap.  II.  This  is  taken  to  be  10,000 
transistors,  15,000  diodes,  15,000  composition  resistors, 
5000  film  resistors,  and  5000  mica  capacitors.  These 
reduce  to  18,334  equivalent  transistors.  The  size  of  a 
system  is  subsequently  expressed  in  multiples  of  this  unit 
parts  complement. 

7.  PREDICTED  PART  FAILURE  RATES 

Part  cost  versus  reliability  and  procurement  policy 
is  investigated  next,  for  integrated  circuits  as  well  as 
the  discrete  type.  When  all  is  said  and  done,  the  most 
significant  results  are  shown  again  in  Table  VII-2.  When 
utilizing  the  availability  graphs  presented  below,  the 
reader  should  select  part  failure  rates  from  this  table. 

8.  INTEGRATED  CIRCUITS 

Further  discussion  of  the  reliability  of  integrated 
circuits  is  given,  and  Chap.  II  ends  with  conclusions  and 
recommendations  on  part  procurement  and  handling  which 
really  cannot  be  adequately  summarized  here. 

9.  CIRCUIT  DESIGN 

Chapter  III  introduces  the  subject  of  circuit  design 
and  describes  the  philosophy  of  "bogey,"  "worst  case,"  and 
"statistical"  design  procedures.  The  latter  two  methods 
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will  insure  the  satisfactory  operation  of  the  circuit 
even  though  the  parameters  of  some  parts  vary  over  wide 
ranges . 

Recommendations  for  good  circuit  design  are  given 
next;  these  are  independent  of  earlier  parts  of  the  chapter 
and  can  be  read  by  themselves.  Following  them  is  a  dis¬ 
cussion  of  logical  design  which  contains  some  suggestions 
on  how  to  minimize  errors  in  the  design  process. 


Table  VII-2 

PREDICTED  PART  FAILURE  RATES 
(7./1000  hr,  10-year  average) 


Part 

Good  1965 

Best  1965 

Best  1968 

Resistor 

(composition,  metal 
film,  tin  oxide) 

.0003 

.0001 

.00003 

Capacitor  (glass, 
mica) 

.0003 

.0001 

.00003 

Diode  (silicon 
planar) 

.005 

.0015 

.0005 

Transistor  (silicon 
planar) 

.01 

.003 

.001 

Integrated  circuit 
(silicon  planar) 

10  equivalent  parts 

.02 

.005 

.0015 

30  equivalent  parts 

.04 

.009 

.0025 

100  equivalent  parts 

.07 

.015 

.0040 
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10.  PART  REDUNDANCY 

Chapter  III  proceeds  to  the  concept  of  part  redundancy 
to  counter  outright  part  failure.  The  question  of  whether 
to  adopt  the  redundant  circuit  technique  as  a  means  to  system 
reliability  does  not  appear  until  Chap.  IV,  but  Chap.  Ill 
compares  non-redundant  with  various  types  of  redundant  com¬ 
puters  for  the  asymptotic  case  (i.e.,  for  very  large  times 
after  first  turning  the  system  on)  in  Fig.  VII-1,  and  for  the 
transient  phase  in  Fig.  VII-2.  It  is  later  shown  th- t 
if  service  is  available,  circuit  (or  part)  redundancy  is 
not  the  best  way  to  obtain  high  system  availability.  But 
if  no  service  is  available,  redundancy  is  best. 

11.  FAILURE  DETECTION 

Finally,  Chap.  Ill  takes  up  the  problem  of  failure 
detection.  In  this  chapter,  failure  detection  and  cor¬ 
rection  are  considered  from  a  circuit  standpoint;  later 
this  same  topic  is  treated  from  a  programming  viewpoint. 
Generally,  it  is  concluded,  error  detection  can  and  should 
be  performed  wherever  possible,  and  the  expense  is  not 
exorbitant.  But  error  correction,  while  often  feasible, 
probably  does  not  warrant  the  expense,  particularly  in  light 
of  the  elegant  fault-diagnosis  techniques  which  are  cur¬ 
rently  under  development. 

12.  MULTIPLE  COMPUTERS  AND  THE  MULTI- PROCESSOR 

After  defining  various  methods  of  using  more  than  one 
computer  to  achieve  higher  availability  (duplex,  triplex, 
and  multi-processor),  these  systems--the  redundant  and 


10^  (asymptotic  availability  of  entire  computer) 
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99.99 


Fig.  VII-l--Asymptotic  availability  of  redundant  computer 

(exponential  service) 


Present  computer  grade  w /  burn-in  (100  units) 
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single  non-redundant  computers--are  compared  in  Chap.  IV. 
Figures  VII-3  to  VII-6  show  the  results.  The  list  of 
symbols  should  make  them  self-explanatory.  In  almost  all 
cases,  the  multi-processor  is  better  than  the  duplex 
arrangement,  and  the  duplex  system  is  superior  to  the  re¬ 
dundant  system.^  This  is  true  only  for  the  asymptotic 
case,  which  is  the  proper  region  for  our  attention  when 
service  can  be  provided.  The  transient  behavior  of 
multiple  computers  is  not  calculated  fcr  most  cases,  be¬ 
cause  this  implies  a  well-defined,  relatively  short,  system 
lifetime  with  no  service  (which  just  is  not  the  case  with 
most  ground-based  systems).  Unquestionably,  ncn-redundant 
methods  will  not  measure  up  to  redundant  ones  during  this 
transient  phase. 

Furthermore,  the  problem  of  the  increased  size  and 
complexity  of  the  program  must  be  carefully  investigated 
before  making  a  decision  in  favor  of  the  multi-processor. 
This  decision  is  so  problem-dependent,  and  the  entire  sub¬ 
ject  so  new,  that  useful  guide  lines  are  at  this  time 
impossible. 

13.  PROGRAMMING 

Chapter  V  points  out  that  methods  cannot  be  presented 
that  insure  perfect  programs.  After  all,  correct  codes  are 

^ Recommending  the  multi-processor  requires  restraint, 
because  the  excellence  of  this  system  appears  only  when  it 
solves  problems  which  allow  simultaneous  processing  of 
different  parts  of  the  problem.  Problems  which,  although 
very  large,  must  be  done  in  a  strictly  sequential  manner 
are  not  candidates  for  the  multi-processor,  and  this  system 
is  no  longer  first  choice. 
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not  like  "correct"  circuit  designs;  they  are  either  right  or 
wrong  and  not,  in  theory,  a  matter  of  judgment.  But  the 
problem  still  remains  as  to  how  to  tell  if  they  are  right. 
The  nature  of  the  problem  is  outlined  and  a  modus  operandi 
that  should  be  helpful  in  obtaining  error-free  programs  is 
presented  in  Chap.  V.  It  must  be  read  in  its  uncondensed 
form.  Like  so  many  other  good  and  seemingly  reasonable 
sets  of  rules,  these  do  not  become  difficult  until  they 
are  applied;  whether  a  group  of  programmers  (and  their 
supervisors)  could  be  made  to  follow  the  rules  is  specu¬ 
lative. 

It  is  further  concluded  that  very  little  can  be  done 
in  terms  of  programmed  error  detection.  "Programmed 
error  detection"  means  the  execution  of  a  set  of  instruc¬ 
tions  whose  purpose  it  is  to  verify  the  correctness  of  the 
computer  logic  or  another  set  of  instructions.  Test  prob¬ 
lems,  whose  solutions  are  known,  approximate  an  automatic 
check  of  the  logic,  but  no  program  which  checks  another 
program  yet  exists.  Simulation  is  a  possibility  and  there 
is  still  room  for  further  research,  but  all  in  all  this  is 
not  a  very  optimistic  area. 

14.  CONNECTORS  AND  PACKAGING 

Connectors  and  packaging  do  not  constitute  a  threat 
to  availability  if  the  best  of  today’s  techniques  are 
employed.  How  to  treat  extreme  environment  (e.g.,  exces¬ 
sive  radiation)  is  just  barely  touched  upon  near  the  end 
of  Chap.  V. 
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15.  MANUFACTURER 

Finally,  Chap.  V  considers  the  role  of  the  manufacturer. 
For  a  subject  of  such  importance,  and  one  which  is  so  non- 
quantifiable,  very  little  has  been  said  here.  The  subject 
of  the  manufacturer's  attitude  and  what  helps  or  hinders  the 
procurement  of  high-quality,  reliable  equipment  needs  much 
more  attention. 

16.  AUTOMATIC  FAULT  DIAGNOSIS  AND  ISOLATION 

Chapter  VI  outlines  rather  completely  the  current 
approaches  to  automatic  fault  diagnosis  and  isolation. 

It  shows  the  success  of  any  particular  scheme  depends 
heavily  on  pre-planned  hardware  additions,  the  quality  of 
the  service  people,  and  the  amount  of  money  one  wishes  to 
spend. 

Automatic  fault  diagnosis  and  isolation  programs  can 
be  written  (e.g.,  IBM  System/360).  They  are  effective 
(most  single  errors  locatable  to  a  single  module),  and 
reasonably  complete  (the  single  error  must  be  excluded 
from  only  ten  per  cent  of  the  machine). 

17.  OPTIMUM  MODULE  SIZE 

Some  analysis  of  the  optimal  size  of  a  replaceable 
module  concludes  Chap,  VI.  This  is  considered  by  first 
estimating  the  relative  time  to  diagnose  the  fault  as  a 
function  of  the  module  size,  then  estimating  the  relative 
cost  of  a  set  of  spares  as  a  function  of  module  size.  The 
two  functions  are  summed  with  equal  weights,  and  the  re¬ 
sulting  function  is  observed  to  have  an  absolute  minimum-- 
which  implies  the  probability  of  an  optimum  size. 
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Appendix  A 

INTRODUCTION  TO  THE  MATHEMATICS  OF  AVAILABILITY 

1 .  INTRODUCTION 

This  appendix  presents  a  brief  derivation,  from 
first  principles,  of  the  availability  results  cited  in 
Chaps.  Ill  and  IV.  Nothing  done  here  is  either  new  or 
novel.  Appendix  A  is  included  in  this  report  not  only  for 
the  reader's  convenience  (to  aid  in  attacking  problems  not 
explicitly  discussed  in  the  main  text) ,  but  also  to  support 
the  conclusions  presented  above.  Knowledge  of  elementary 
probability  theory  is  assumed,  and  the  presentation  will 
be  as  brief  as  possible.  For  greater  detail  (and  consider¬ 
ably  more  elegance)  the  reader  is  referred  to  Cox  [1], 
Before  proceeding,  some  comments  on  transient  and 
asymptotic  (steady- state)  solutions  are  in  order.  Most 
of  the  pertinent  results  concern  a  probability  function, 
P(t),  usually  called  the  "availability"--it  is  the  proba¬ 
bility  that  a  system  is  available  for  use  at  time  t.  A 
"system"  is  anything  from  a  single  transistor  to  an  entire 
computing  complex,  depending  on  the  context.  In  some 
instances,  this  function  is  very  difficult  to  evaluate 
(for  either  analytic  or  numerical  reasons  which  will  be 
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discussed  presently),  and  the  asymptotic  value,  = 
lim  P(t),  will  be  computed  instead. 

£  — *CD 

Care  must  be  taken  that  P^  in  fact  answers  the  proper 
questions  about  availability,  because  the  system  might  have 
a  transient  period  that  is  significant  relative  to  the 
equipment  lifetime;  i.e.,  at  the  time  we  are  concerned  about 
the  value  of  P(t),  it  may  not,  in  truth,  be  close  enough  to 
Pro.  It  is  argued  here  that  for  systems  whose  reliability 
must  be  very  high  and  where  service  is  possible,  the  use  of 
Pro  is  perfectly  acceptable.  The  following  reasons  support 
this  view: 

o  For  all  cases  of  interest,  P  <  P(t),  hence  P  is 
at  least  a  lower  bound  on  the  availability; 

o  Most  electronic  design  procedures  such  as  "worst 
case"  or  "almost  worst  case"  introduce  more  "over- 
design"  into  the  system  than  would  using  Pro  as  a 
measure  of  availability  instead  of  the  true  P(t). 

In  addition  to  these  reasons  which  only  say,  in  effect, 
that  no  harm  is  done  by  using  P  ,  there  are  some  more 
definite  results. 

o  For  non-redundant  systems,  P(t)  is  of  such  a  form 
that  very  small  transient  periods  imply  large  P  , 
and  conversely.  Therefore,  if  a  highly  reliable 


-151- 


system  is  desired,  the  system  will  have  a  very  short 
transient  phase,  and  the  error  in  neglecting  the 
transient  contribution  to  P(t)  is  negligible, 
o  If  service  is  possible,  then  the  redundant  computer 
(even  with  service)  is  not  better  (i.e.,  does  not 
have  larger  P(t))  than  the  multiple  machine  con¬ 
figuration  except  for  very  small  times.  Redundant 
techniques  are  always  accompanied  by  long  transient 
periods.  Redundant  computers  are  considerably  more 
complex,  for  fixed  P(t) ,  than  multiple  computers, 
and  their  use  in  ground-based,  serviceable  systems 
is  doubtful.  Of  course,  without  service,  the  re¬ 
dundant  machine  is  the  proper  candidate,  since  the 
transient  term  accounts  for  most  of  the  reliability-- 
thus  their  favored  use  in  space  applications. 

In  the  work  that  follows,  the  service  time^  will  always 
be  assumed  to  be  exponentially  distributed.  When  discussing 
the  repair  of  a  computer,  this  assumption  seems  entirely 
reasonable-- the  only  other  likely  candidate  being  a  service 
time  of  fixed  duration.  These  two  cases  are  the  same  for 
the  asymptotic  situation;  i.e.,  if  the  mean  of  the  exponen¬ 
tial  service  distribution  is  equal  to  some  assumed  constant 

"Service  time"  is  a  random  variable  whose  sample  value 
is  the  length  of  time  the  computer  is  inoperative  following 
a  failure. 
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service  time,  then  the  probability  of  the  machine  being  on, 

1. e.,  working  properly,  for  large  t  is  the  same  regardless 
of  which  distribution  is  assumed.  This  is  not  true  for  the 
transient  phase,  but  the  analysis  is  complicated  and  will 
not  be  carried  out  here.  Readers  desiring  more  information 
on  the  constant  service  time  case  are  referred  to  Saaty  [2] 

2.  DEFINITIONS  AND  THE  POISSON  PROCESS 

First,  some  definitions  and  notation  are  necessary. 

Let  T^  be  a  random  variable  denoting  the  time  of  failure 
(exactly  what  has  failed,  a  single  part  or  an  entire  system 
will  be  clear  from  the  conte^c) .  Assume  T^  has  a  distri¬ 
bution  function,  F(t)  =  Pr[Tf^tj.  The  "failure  rate," 
r(t),  is  defined  by  the  conditional  probability^ 

Pr[t  £  Tf  £  t+At  |T  a-  t] 

r  (t)  =  lim  - — -  (1) 

At-0 

If  T^  also  has  a  probability  density  function,  f(t),  then 

t 

F(t)  =  J  f(x)dx  (2) 

0 


+ 

Another  definition  of  r(t)  is  that  it  gives  the 
probability  of  "immediate"  failure  of  an  item,  given  that 
the  item  has  age  t. 
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F(t)  =  1  -  F(t)  =  j  f(x)dx 

t 


and 


r(t)  =  lim 
At-0 


Pr  [t  s  Tf  s  t+At,  Tf  s  t] 
At  Pr[Tf  >-  t] 


lim 
At  -0 


Pr [t  s  Tf 
At  Pr [T 


_ fit} _ 

~  [1  -  F ( t )  1 


F(t) 


By  integrating  (4)  we  get 


t 

[  r(x)dx  = 
0 

0 


t 

f  dF(x)  = 
J  l-F(x) 

0 


In [1  -  F (t)  ]  . 


Hence 


F(t)  =  1  -  exp 


t 

J  r(x)dx 
0 


and 


f(t)  =  r (t)  exp 


t 

* 

r (x)dx 

0 


(3) 


£  t+At] 

?  t1 

(4) 


(5) 


(6) 
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Failure  processes  whose  failure  rate,  r(t),  is  a 

constant  are  used  frequently  enough  to  make  it  worthwhile 

to  derive  these  processes  from  first  principles.  We  first 

require  the  notion  of  a  Poisson  process  which  has  the  very 

attractive  property  that  the  future  behavior  of  the  process 

v 

is  independent  of  the  past.  That  is,  the  probability  of 
an  event  occurring  in  an  interval  of  length  A  depends  only 
on  the  length  of  the  interval  and  not  on  when  A  occurs  in 
the  time  history  of  the  process.  A  sufficient  specifica¬ 
tion  for  a  stochastic  process,  f(t),  to  be  a  Poisson  process 
is  given  by  the  following  postulate  [3,4^: 

The  stochastic  process  f(t)  is  a  Poisson 
process  if,  for  A  sufficiently  small, 
there  exists  a  positive  constant  X  such 
that  the  probability  of  one  event  occur¬ 
ring  in  (t,t+A)  is  XA  +  o(t)  and  the 
probability  of  more  than  one  event  oc¬ 
curring  is  o(t) . t 

The  exponential  failure  distribution  (constant  failure 
rate)  may  be  derived  from  this  postulate.  Consider  a 
single  part  whose  failure  behavior  obeys  the  postulate  and 
is  instantaneously  replaced  with  an  identical  part  if  a 
failure  does  occur.  Define  a  random  variable  which 


o(x)  denotes  a  function  that  has  the  property 
lim  o(x)/x  =  0. 
x-0 
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denotes  the  number  of  failures  in  (0,t)  and  let  P  (t)  = 

m 

Pr  =  m  .  Use  of  the  postulate  gives  the  following 
difference  equation: 


P  (t+A)  =  P  (t)  (1  -  XA)  +  P  1(t)XA  .  (8) 

m  m  m- 1 


Rearranging  and  passing  to  the  limit  gives  the  differential 
equati on 


lim 

A-0 


P  (t+A)  -  P  (t) 
m _ m 

A 


P\(t) 

m 


X [P  ,(t)  -  P  (t)]  .f 
m- 1  m 


(9) 


For  m  =  0,  Eq.  (9)  reduces  to 

Pfl(t)  +  XP0(t)  =  0  ,  (10) 

Solving  (10)  with  the  initial  condition  Pq(0)  =  1,  and 
then  solving  (9)  repeatedly,  we  find  that  pm(t)  is  given 
by  the  Poisson  distribution 


Use  of  the  prime,  e.g.  ,  P' (t) ,  will  always  denote 
differentiation  with  respect  to  time. 
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n>ni  -At 

pm(t)  -  Pr[Nt  -  =  <-X^,e  .  (11) 

Using  the  definition  given  in  Eq.  (3)  and  assuming  a 
Poisson  process  therefore  gives 

F(t)  -  Pr[Tf  >  tl  =  PQ(t)  =  e“Xt  .  (12) 

Then  F(t)  is  the  exponential  distribution  function 

F(t)  =  Pr[Tf  s  t]  -  1  -  e~Xt  .  '  (13) 

Thus 

f(t)  =  Ae"Xt  (14) 

and  from  Eq.  (4) 

r(t)  -  f(t)/F(t)  =  A  .  (15) 


Also,  the  expectation  of  N  is  1/A,  which  is  commonly 
called  the  mean  time  to  failure  (MTTF) . 
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3.  NO  REDUNDANCY- -NO  SERVICE  AND  EXPONENTIAL  FAILURE 
DISTRIBUTION 

As  a  rough  approximation  of  reality,  assume  that  a 
computer  consists  of  N  identical  parts.  Further,  assume 
that  the  part  failures  are  statistically  independent,  and 
subject  to  failures  which  are  exponentially  distributed 
with  failure  rate  A. 

Then  the  probability  that  a  part  survives  oast  time 
t  is  given  by 

Fp(t)  =  e'At  (16) 

and  the  probability  that  the  entire  computer  survives  past 
t,  Fc(t),  (given  that  it  was  on  at  t=D)  ,  is  equal  to  the 
probability  that  all  N  parts  have  survived  past  t,^ 

Fc(t)  -  e~NXt  .  (17) 

For  completeness  more  than  any  other  reason,  F  (t)  is 
shown  in  Fig.  A-l  for  some  interesting  valuer  of  MX. 

This  assumption,  that  the  failure  of  any  part  causes 
the  failure  of  the  machine,  is  the  standard  approach. 
However,  in  practice  it  might  not  be  the  case,  since 
it  is  dependent  on  the  nature  of  the  program. 
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For  a  slightly  more  accurate  model,  assume  that  a 
computer  consists  of  N^  transistors,  ^  diodes,  and  N^ 
resistors  and  capacitors,  where  N^+^+N^  =  N.  If  a  tran¬ 
sistor  has  failure  rate  A^,  diodes  and  resistors  and 
capacitors  A^,  then  the  probability  that  the  computer  is 
operating  after  time  t  is  F  (t)  =  exp[- (N1  A1+N0A0+N„A„) J . 

C  I  J.  L  L  J  J 

4.  2m+l-F0LD  REDUNDANCY  WITH  PERFECT  VOTING- -NO  SERVICE 
AND  EXPONENTIAL  FAILURE  DISTRIBUTION 

The  purpose  of  this  and  the  next  few  sections  is  to 
give  a  general  probabilistic  description  of  machines  which 
emp^ry  redundancy  to  achieve  reliability.  Knox-Seith  [5] 
and  Wilcox  and  Mann  [6]  give  a  much  more  detailed  analysis 
of  the  problem. 

First  consider  a  collection  of  M  subsystems,  l^M^N, 
which  might  constitute  an  N-part  computer  (N/M  is  an  integer). 
The  size  of  M  defines  the  level  at  which  the  redundancy  will 
be  employed.  For  statistical  purposes,  all  M  subsystems 
will  be  assumed  identical.  For  instance,  if  M  ■  N,  then 
each  individual  part  is  a  "subsystem";  but  if  M  *  1,  the 
entire  computer  is  the  subsystem.  The  intent  is  to  repro¬ 
duce  each  of  the  M  subsystems  2m+l  times  in  order  to  gain 
the  benefits  of  a  redundant  system.  This  is  illustrated  in 
Fig.  A-2. 
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Fig.A-2  —  Redundant  computer 
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Each  subsystem  contains  N/M  parts  and  all  parts  are 
independent  and  have  identical  exponential  failure  distri¬ 
butions.  The  probability  of  survival  of  any  non-redundant 
subsystem  is 


Fg(t)  =  exp(-NXt/M)  .  (18) 

Now  with  2m+l-fold  redundancy  and  failure-fret  majority 
voting,  a  subsystem  survives  if  at  least  m+1  of  the  cir¬ 
cuits  making  up  the  redundant  group  survive.  The  proba¬ 
bility  of  this  event  is  given  by 

2m+l 

Ps(t)  =  PrCk  s  m+1]  =  £  b  [k;  2m+l,  Fg(t)] 

k=m+l 

m 

-  1  -  £  b  [k;  2m+l ,  Fg(t)]  (19) 

k=0 


where  k  is  the  number  of  circuits  which  survive  past  t  and 
b(k;n,p)  =  n!p^(l-p)n  ^/k!(n-k)I.  The  failure  of  any  re¬ 
dundant  group  can  cause  the  failure  of  the  computer.  Hence 
the  probability  that  the  computer  survives  until  t,  assum¬ 
ing  independent  subsystems,  is 
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rcU>  -  [Ps(t)1M 


(20) 


P  (t)  is  shown  graphically  in  Figs.  A-3  to  A- 10  for 
m=l,2  and  some  typical  values  of  NX  and  M.  If  the  assump¬ 
tion  of  perfect  voting  is  used,  inspection  of  Eq.  (20) 
shows  that  Pc(t)  is  maximized  (for  fixed  N  and  X)  by  choos¬ 
ing  M  as  small  as  possible;  i.e.,  the  voting  should  be  done 
at  the  lowest  possible  level. 

5.  N-FOI.D  REDUNDANCY  WITH  IMPERFECT  VOTING--NO  SERVICE 
AND  EXPONENTIAL  FAILURE  DISTRIBUTION 

If  the  voters  themselves  are  also  susceptible  to 
failure,  then  redundant  voters  should  be  used  to  increase 
the  voter  reliability.  This  results  in  the  system  shown 
in  Fig.  A- 11.  Again,  each  non-redundant  subsystem  has 
N/M  parts  and  each  subsystem  has  independent  and  identical 
survival  probability,  Pg(t)  =  exp(-NXt/M).  When  using 
redundant  voters,  it  is  not  necessarily  optimum  to  vote 
each  redundant  subgroup.  Instead,  L  groups  are  connected 
in  series  prior  to  each  voting  operation.  The  probability 
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Fig.A-3 — Availability  of  redundant  computer — no  service  and  exponential  failure 
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Fig.A-5 — Availability  of  redundant  computer — no  service  ar;d  exponential  failure 
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that  a  series  chei  of  L  non-redundant  subsystems  survives 
until  t  is  PT (t)  =  [P  (t)]^  =  exp(-LNAt/M) .  Now  with  2m+l 
redundancy,  the  output  of  which  will  be  majority-voted,  the 
probability,  P^(t),  that  there  are  at  least  m+1  L-chains  is 

2m+l 

Pr(t)  =  £  b  [k;2m+l,PL(t)]  .  (21) 

k=m+l 


Next,  assume  that  every  voter  is  independent  and  has 
survival  probability  Fv(t)  =  exp(-£t).  At  least  m+1  must 
be  operating  for  every  2m+l  redundant  chain.  The  survival 
probability,  Prv(t) ,  for  the  redundant  L-chain  plus  voter  is 

2m+l  2m+l 

Prv(t)  =  I  1  b  [k;2m+l,PL(t)]  b  [j;2m+l,Fv(t)]  . 

k=m+l  j=m+l 


(22) 


Finally,  the  probability  that  the  entire  computer  survives 
is 


M/L 

Pc(t)  =  [Prv(C)]  •  (23) 

No  numerical  examples  of  Eq.  (23)  are  presented  here, 
since  only  perfect  voting  is  considered  in  the  text.  It 
is  interesting  to  note  before  leaving  this  topic  that  with 
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the  added  parameter,  £,  contributed  by  the  voter  survival 
pro  ability,  there  is  a  truly  optimum  value  of  L  >  1.  It 
is  no  longer  true  that  voting  should  be  done  at  the  lowest 
level,  but  L  is  a  function  of  X  and 

6.  NO  REDUNDANCY- -EXPONENTIAL  SERVICE  AND  FAILURE 
DISTRIBUTION 

By  arguments  given  previously,  we  have  shown  that  T^, 

the  time  to  first  failure  of  an  entire  non-redundant  N- 

component  computer,  is  exponentially  distributed  with 

failure  rate,  r(t)  =  N'  f(t)  =  NXexp(-NXt),  and  F(t)  = 

exp(-NXt)  (if,  of  course,  part  failures  are  exponentially 

distributed).  Now  define  a  new  random  variable,  T  ,  called 

s 

the  service  time  (the  time  to  repair  the  computer),  and 
take  this  also  to  be  exponentially  distributed  with  service 
rate  \jl.  The  pertinent  distributions  for  the  service  time 


G (t)  =  Pr [T  *  t]  -  1  -  e"Mt  , 

(24) 

G (t)  =  Pr[Tg  >  t]  =  e'Mt  , 

(25) 

g(t)  =  G' (t)  =  Me"Mt  • 

(26) 

The  question  now  is  what  is  the  probability,  P(t),  that 
the  computer  is  on  at  time  t? 


-174- 


Remembering  that  both  the  failure  and  repair  dis¬ 
tributions  are  derived  from  Poisson  processes,  and  hence 
have  no  memory,  we  may  write  a  different  equation  for  P(t). 
Namely,  for  A  sufficiently  small 

P(t+A)  =  P(t)F (A)  +  [1  -  P(t) ]G (A)  (27) 


where  F(A)  =  exp(-NAA)  and  G(A)  =  1  -  exp(-/iA).  In  other 
words,  the  probability  that  the  computer  is  on  at  t+A  is 
equal  to  the  probability  that  the  computer  is  on  at  t  and 
remains  on  for  a  small  additional  time  A,  or  that  the 
computer  was  off  at  time  t  and  repair  was  completed  in  an 
additional  time  A.  The  value  of  A  is  chosen  to  be  so  small 
that  the  probability  of  more  than  one  repair  or  failure 
is  negligible.  Expanding  F (A)  and  G (A)  in  a  Taylor's 
series  and  retaining  only  the  first-order  terms  transforms 
Eq .  (27)  to 


P(t+A)  =  P(t)(l  -  NAA)  +  [1  -  P(t)>A  . 


(28) 


Rearranging  and  taking  the  limit  gives 


lim  P(‘t'+AA  =  P'(t)  =  -  (NA  +  jx)P(t)  +  H  . 

A-0  A 


U9) 
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With  the  choice  of  initial  condition,  P(0) ,  Eq.  (29)  is 
easily  solved  for  P(t).  The  most  interesting  initial  con¬ 
dition  is  P(0)  =1.  We  find  that  the  probability  that  the 
computer  is  on  at  any  time  t  is 


P(t) 


(j  NX  -(NX  + 

NX  +  n  +  NX  +  n  6 


(30) 


P(t)  is  usually  called  the  "availability"  (or  some¬ 
times,  the  "readiness").  P^  =  lim  P(t)  =  (NX  +  fi) . 

t-“ 

Figure  A-12  shows  Pm  for  selected  values  of  NX  and  fi. 

7.  REDUNDANT  COMPUTER- -EXPONENTIAL  SERVICE  AND  FAILURE 
DISTRIBUTIONS 

If  redundancy  is  now  introduced  into  the  computer  at 
some  subsystem  level,  and  a  constant  service  rate  is  main¬ 
tained,  the  problem  becomes  considerably  more  complex. 

In  the  following  analysis,  the  case  of  perfect  voting  will 
always  be  assumed. 

Let  P  (t)  be  the  probability  that  a  single  redundant 
s 

subsystem  is  on  at  time  t.  Next  define  two  random  vari¬ 
ables:  Tg,  the  length  of  time  the  subsystem  is  on  before 
its  first  failure  starting  at  t=0;  and  T  ,  the  length  of 
time  the  subsystem  is  off  immediately  following  T^.  As 

usual,  Pr[Tf  ^  t  =  F(t)  and  Pr [T  £  t]  =  G(t).  Now  define 
r  s 


NX 


10 


-3 


10 


-2 


10 


-1 


ymptotic  availability  of  non-redundant  computer 
exponential  failure  and  service) 
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a  new  random  variable  T  =  T-  +  T  ,  where  T  is  the  time  at 

f  s 

which  the  subsystem  is  on  again  after  its  first  failure, 
and  let  Pr[T  £  t]  =  H(t) .  We  know  that 

CD 

Ps(t)  =  j  Ps(t|T=u)dH(u)  .  (31) 

0 

I  f  U  £  t , 


Ps(tlT=u)  =  Pg(t-u)  .  (32) 

Equation  (32)  is  valid  only  if  the  entire  subsystem 
is  replaced  at  t  --  u,  thus  insuring  a  renewal  process,  or 
if  the  failure  process  is  a  Markov  process  (a  class  of 
stochastic  processes  of  which  the  Poisson  processes  are  a 
subset),  l.e.,  the  future  state  of  each  subsystem  is  de¬ 
pendent  only  on  the  present  state  and  not  on  the  past. 

If  u  >  t, 

Ps(t|T=u)  =  Ps(t  <  Tf|T=u)  (33) 

and 


CO  y 

j  Ps(t |T=u)dH(u)  =  J  Pg(t  <  Tf  |T=u)dH(u) 

0  t 

=  Ps(t  <  Tf)  =  1  -  F(t)  . 


(34) 
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Sub  s  ti  tut  ing  Eqs.  (32)  and  (34)  into  Eq.  (31)  gives  the 
following  integral  equation: 

t 

P  (t)  =  1  -  F(t)  +  f  P  (t-u) dH(u)  .  (35) 

O  Jo 

0 

Assuming  that  T^,  T  ,  and  T  have  density  functions  f,  g, 
and  h  respectively,  and  if  ar d  Tg  are  independent,  then 

T 

h(T)  =  J  f (T-v)g(v)dv  (36) 

0 

and 

t 

P  (t)  =  1  -  F(t)  +  P  (t-u)h(u)du  .  (37) 

S  t  s 

o 

Equation  (37)  is  a  Volterra  integral  equation  of  the  second 
kind  and  its  solution  will  provide  the  desired  Pg(t). 

One  method  is  to  take  the  Laplace  transform  of  Eq. 

(37)  :  f 

t  ''<■ 

The  one-sided  Laplace  transform,  f  (s) ,  of  f(t),  is 

00 

*  [‘-St 

defined  as  f  (s)  =  e  f(t)dt.  Operations  such  as  taking 

0 

the  transform  of  an  integral  or  of  a  convolution  are  per¬ 
formed  without  explanation  and  the  reader  should  consult 
any  text  on  the  subject  (such  as  Gardner  and  Barnes  [71). 
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.  I  .  £isl 


P  (s)  =  ^  -  -  -S"Z  +  P  (s)f  (s)g  (s) 

S  b  b  b 


(38) 


Rearranging  gives  the  desired  transform  of  P  (t) : 

s 


p  (S)  = 


1  -  f 


■k 


s[  1  -  f  (s)g  (s) 


(39) 


So  far,  our  attention  has  been  directed  toward  ob¬ 
taining  the  entire  transient  solution  to  the  availability 

•f 

problem,  namely  Pg(t).  Success  in  this  endeavor  depends 

k  k 

on  a)  the  ability  to  obtain  the  transform  f  (s)  and  g  (s), 

k 

and  b)  the  inversion  of  Ps(s).  Only  in  a  few  relatively 
simple  examples  does  it  appear  possible  to  get  the  solu¬ 
tion  by  the  method  of  Laplace  transforms.  Such  an  example 
will  be  given  next.  Theorems  do  exist  which  relate  P 

directly  to  f(Tf)  and  g(T  )  and  will  be  used  when  required, 
il  s 

To  continue  with  the  case  at  hand,  from  Eq  (19)  (with 
a  slight  shift  in  notation), we  have 


rn 


no  =  l 


(2m+l) I 
k! (2m+l-k) ! 


-NXt 
M 


-NXt 

M 


.2m+l-k 


(40) 


k=0 


+ 

Actually,  we  wait  [P  ( t ) j  ,  assuming  M  independent 

s 

and  identical  subsystems. 
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and 

G(t)  =  1  -  e'Mt  .  (41) 

Specializing  to  the  case  m~l  (three- fold  redundancy)  and 
evaluating  Eq.  (40)  gives 

F(t)  =  1  -  3e‘2T?t  +  2e“3T?t,  r\  -  ^  .  (42) 


Differentiating  Eqs.  (42)  and  (41)  results  in 


f(t)  =  6n  (e~2r}t  -  e_3T?t^)  (43) 

and 

g(t)  =  ,  (44) 


respectively.  These  densities  have  the  following  transforms 


and 


f/c  ■)  =  - HU. - 

^  ’  (s+2??)  (s+3t?) 


■k 

g 


(8) 


=  -JL- 
s+n 


(45) 


(46) 
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Subs  t  i tuting  Eqs.  (45)  and  (46)  into  (39),  we  have 


r  2  2 

srs  +  (5t?+m)s  +  6rj  +  5pi7?] 


(47) 


The  inverse,  P  (t)  ,  may  be  found  by  first  evaluating  the 

poles  of  P  (s) ,  then  using  transform  1.111  of  Ref.  7  if  the 

poles  are  real  or  transform  5. 2- (a)  of  Ref.  8  if  the  poles 

are  complex.  This  work  will  not  be  shown  here,  but  the 

A 

results  are  shown  for  the  selected  cases  M=l,  NA=10  , 

10'3,  10_1  in  Figs.  A-13  to  A-15. 

Proceeding  next  to  the  asymptotic  situation,  we  wish 
to  show  chat  if  T^  and  Tg  are  independent  then 

E  r  t  j 

lim  pg(t.)  =  ErTf j  +  E[TS]  •  (48) 

it 

The  final  value  theorem  states  that  if  sPg(s)  is  analytic 
on  the  axis  of  imaginaries  and  in  the  right  half-plane  [7], 
then 


P 


it 

lim  P  (t)  =  lim  sP  (s)  . 
,  S  A  S 

t-®  s-0 


Therefore,  from  Eq.  (39), 


P 


*  * 

s-0  1  -  f  (~)g  (s) 


(49) 


(50) 


P  (t  )  X  10 
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Fig.  A-l  4 — Availability  of  redundant  computer 
(transient  phase,  exponential  service) 


-1.85- 


where 


Vc 

f  (s)  -  j  e'Stf(t)dt  and  g*(s) 
0 


OD 

'  e'stg(t)dt  . 

i 

0 

(51) 


Substituting  Eq.  (51)  into  (50)  and  taking  the  limit  yield 

OD 

J  t£(t)dt 

p  _ _ 2 _ E[Tf]  ,  x 

"  “  *  E[Tf j  +  E[Tg ]  *  <52> 

J  tf(t)dt  +  [  tg(t)dt 
0  0 


For  the  non-redundant  case  of  Eq.  (30)  for  which  f(t) 

NAexp(-NAt)  and  g(t)  =  #iexp(-/it) ,  we  get  the  result  P  = 

00 

M/(t!A+")  as  shown  Eq.  (30).  For  the  redundant  case  at 
hand,  m  =  1,  E[Tf]  .  5M/6NA,  and  E[Tg]  -  l/u.  Hence 

p  =  — 5MM 

•  5M/i  +  6NA  •  (53) 

Then  P„M,  the  asymptotic  availability  of  the  entire  com- 
P  ter,  is  shown  for  various  values  of  NX,  M,  and  n  in 
Figs.  A-16  to  A-19. 

Next,  let  m  =  2  (five- fold  redundancy).  The  formula¬ 
tion  is  exactly  the  same  as  m  =  1,  except  evaluation  of 
Eq.  (40)  gives 


102  (asymptotic  availability  of  entire  computer) 
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Fig. A-17 — Asymptotic  availability  of  redundant  computer 
(exponential  service) 


10  (asymptotic  availability  of  entire  computer) 
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Fig .  A—  1 9 — Asymptotic  availability  of  redundant  computer 
(exponential  service) 
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F(t)  =  1  -  I0e'3r}t  +  15e‘4??t  -  6e'57?t  (54 

differentiating 

£(C)  =  30r,  (V3,,t  -  2e-Ant  +  e-5,,t)  (55; 

and,  as  before,  g(t)  =  /uexp(-/Lit)  . 

For  m  =  2,  the  Laplace  transform  method  will,  in 
theory,  work,  but  the  effort  doesn't  appear  warranted; 
only  the  asymptotic  case  will  be  examined.  From  Eq.  (55), 
we  find  E[Tf]  -  47/60„.  Then,  from  Eq.  (52), 

D  =  47M^t 

“  47Mjx  +  60NA  *  (56) 

Figures  A-20  to  A-23  show  P  M  for  m  =  2 

Before  leaving  the  subject  of  transient  solutions, 
presenting  a  numerical  method  for  computing  P(t)  is  worth¬ 
while,  in  case  no  analytical  procedure  can  be  found.  This 
method  has  pitfalls-the  user  should  be  warned  that  the 
choice  of  a  time  step,  A,  which  might  be  required  to  yield 
a  sufficiently  accurate  (or  even  convergent)  solution  may 
be  exceedingly  small  for  certain  choices  of  NX,  M,  and  jx. 


102  (asymptotic  availability  of  entire  computer) 
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10"5  10-4  io"3  10"2  lCT 

NX 

Fig.A-23 — Asymptotic  availability  of  redundant  computer 
(exponential  service) 
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No  guide  in  making  this  choice  can  be  given,  so  a  certain 
amount  of  trial  and  error  should  be  expected.^ 

Start  again  with  the  Volterra  equation  (Eq.  (37)) 


P(t)  =  1  -  F(t)  +  P(u)h(t-u)du 

l 

0 


(57) 


where 


t 

F(t)  =  j  f(u)du 
0 


and 


t-u 

h(t-u)  =  j  f (t-u-v)g(v)dv  . 
0 


(58) 


(59) 


Define  the  discrete  variables  P,  =  P(kA)  ,  h.  =  h[(k-j)Aj, 

K  1. 

where  i=k-j  and  k,  j=l,2,....  Assume  F(t)  has  already  been 
evaluated.  Then  performing  the  integration  in  Eq.  (57)  by 
the  trapezoidal  rule  gives 


For  example,  in  the  transient  solution  of  nv=l, 

7]— . 01 ,  M=l,  and  (i=.05 ,  a  choice  of  A=0.4  hr  gave  a  three 
significant-figure  accuracy.  This  can  be  much  worse  for 
smaller  values  of  rj. 
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P,  -  1 


F  +  - 
k  +  2 


k-1 

l  OA-j 

j=0 


j+rvj-i) 


+ 


k-1 


1  -  Fk  +  Fohk  +  A  I  Vk-j  +  Kh0  •  (60> 

j-1 


Solving  for  P,  gives 


k=l 


2  -  2Fk  +  AP0hk  +  2A  V  P.h,.  , 


pk  ■ 


2  -  Ahr 


JtL 


Vj 


j  <  k,  k-1,2,...  (61) 


From  Eq.  (59),  hQ  =  fQg0,  and,  again  carrying  out  the 
integration, 


k-j-1 


h.  . 
k-J 


2  ^  Ck-j-i8i  +  fk-j-i-lsi+l) 

i=0 


k-j-1 

2  fk-jS0  +  A  I  fk-j-igi  +  2  f08k-j  ' 


(62) 


i=l 


Combining  Eqs.  (61)  and  (62)  gives 
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k=l  k-j-1 

2Fk+APOhk  +  ^2  l  l  Pj(fk-j8o  +  2£k-j-i8l  +  f0«k 

_ i=1  1"1 _ 

2  -  Ah0 

(63) 

Pq  is  the  initial  condition,  the  probability  that  the  machine 
is  on  at  t=0,  and  is  presumed  known. 

8.  NON- REDUNDANT  COMPUTER- -  EXPONENTIAL  SERVICE  AND 
WEIBULL  FAILURE  DISTRIBUTIONS 

Section  III-D  discussed  the  possible  use  of  non-constant 

A' 

failure  rates--in  particular,  the  difficulties  of  testing 
and  accepting  the  hypothesis  that  a  particular  component 
has,  indeed,  a  failure  rate  which  decreases  with  time. 

Setting  these  difficulties  aside  for  now,  we  show  the  con¬ 
sequences  of  assuming  such  a  decreasing  failure  rate  model. 

To  date,  the  most  successful  distribution  (in  the  sense 
that  it  fits  the  experimental  data)  with  non-constant  fail¬ 
ure  rate  has  been  the  Weibull  distribution;  the  link,  how¬ 
ever,  between  real  physical  mechanisms  and  this  distribution 

t 

is  still  very  tenuous,  at  best.  In  any  event,  the  Weibull 
distribution  has  density  function 
+ 

Cox  Lll,  pp,  109-110,  gives  a  very  interesting  re¬ 
lationship  between  the  Weibull  and  exponential  distribu- 
tions--one  with  perhaps  some  relationship  to  reality. 
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f(t)  =  NAata  ■*" 


a  >  0  (64) 


which,  for  a  =  1,  reduces  to  the  exponential  distribution. 
Choosing  q  so  that  0  <  a  <  1  is  what  determines  a  decreas¬ 
ing  failure  rate,  and  its  actual  value  must  be  determined 
by  experiment.  At  this  point,  the  ability  to  obtain 
transient  results  without  recourse  to  strictly  numerical 
methods  appears  very  doubtful.  Figure  A-24  shows  a  com¬ 
parison  of  the  transient  phase  of  an  exponential  failure 
machine  with  one  having  a  Weibull  failure  distribution 
with  a  =  0.5  by  using  the  numerical  method  described  in 
Sec.  A-7.  These  are  non-redundant  serviced  machines  with 
NX  =  0.1,  and  n  -  1. 

The  asymptotic  solution  proceeds  quite  easily.  First, 
the  expectation  of  T^  is  required: 


E[Tf] 


tf(t)dt  =  NAa 

v 

0 


00 


0 


(65) 


Let  u  =  NAttt 


to  get 


K/0 

o 


1/a 

e'Udu  =G0  r  <i+i/a>  • 

(66) 


Time  (hr) 

Fig.A-24 — Comparison  of  transient  availability,  exponential  and  Weibull 
failure  distributions  (non-redundant  computer,  exponential  service) 
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Then,  assuming  as  usual  that  g(t)  =  jiexp(-ut), 


P  = 


UT(1  +  1/a) 
HT(1  +  1/a)  +  (NX) 


1/a  * 


For  the  usual  values  of  NX  and  /i,  a  comparison  between 
the  exponential  and  Weibull  (a  -  0.5)  failure  models  is 
shown  in  Fig.  A- 25. 


9.  MULTIPLE  COMPUTERS 

The  next  topic  is  the  availability  of  multiple  non- 
redundant  computers.  The  benefits  to  be  derived  from  the 
use  of  more  than  one  computer,  where  the  extras  are  treated 
as  spares  and  profitably  used  only  in  the  advent  of  a 
machine  failure,  has  been  discussed  in  Chap.  Ill;  only 
the  analysis  appears  here. 

10.  SINGLE  COMPUTER- -EXPONENTIAL  SERVICE  AND  FAILURE 
DISTRIBUTIONS 

This  is  exactly  the  case  already  discussed  in  Sec. 

A-6,  but  the  results  are  repeated  here  for  easy  reference. 
The  solution  was  obtained  by  formulating  the  difference 


P  (t+A)  =  P  (t)(l  -  NAA)  +  '1  -  P  (t)  /iA  +  o(A), 

(68) 


equation 


P(t)  X  10 
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N\ 

Fig.A-25 — Asymptotic  availability  of  non-redundant  computer,  exponential 
and  Weibull  failure  distributions  (exponential  service) 
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passing  to  the  limit  to  get  the  differential  equation 


P^(t)  +  (NX  +  /i)P  (t)  =  M 


(69) 


and  solving 


Pc(t) 


V  .  NX  -(NX  +  ju)t 

nTTTI  h  e 


(70) 


11.  DUPLEX  COMPUTERS  WITH  FLEXIBLE  SERVICE-- EXPONENTIAL 

SERVICE  AND  FAILURE  DISTRIBUTIONS 

When  discussing  duplex  systems,  we  assume  that  the 
system  has  adequate  error  checking.  Either  each  machine 
can  check  itself  or  the  machines  work  together  to  ascertain 
which  is  in  error. 

The  simpler  of  the  two  service  situations  has  flexible 
service,  so  that  each  machine  is  independent  of  the  other 
and  the  probability  that  the  system  has  at  least  one  ma¬ 
chine  operating  is  easily  written  as 

P(t)  -  1  -  '1  -  Pc(t)‘2  (71) 


where  Pc(t),  the  availability  of  a  single  machine  system, 
is  given  by  Eq.  (70).  In  the  limit,  the  asymptotic  proba¬ 
bility  is 
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(72) 


A  graph  of  P^  is  given  in  Fig.  A-26.  This  should  be  com¬ 
pared  with  Fig.  A- 12  where  the  asymptotic  value  of  Eq . 

(70)  is  shown. 

12,  DUPLEX  COMPUTERS  WITH  LIMITED  SERVICE-- EXPONENTIAL 

SERVICE  AND  FAILURE  DISTRIBUTIONS 

"Limited  service"  means  that  both  computers  cannot 
be  serviced  at  the  same  time.  If  both  fail,  one  must  wait 
for  service  until  the  other  is  fixed.  This  introduces  a 
dependence  between  the  two  renewal  processes  which  compli¬ 
cates  the  problem. 

Following  the  derivation  in  Sec.  A-6,  and  assuming 
two  machines,  Computers  A  and  B,  P  (t)  is  the  probability 
that  Computer  A  is  on  at  time  t.  P^(t)  has  a  similar 
definition.  Let  a  be  the  probability  that  Computer  A  is 
selected  for  service  if  both  computers  have  failed.  Then 
Computer  B  is  selected,  given  that  both  have  failed,  with 
probability  1-a. 

The  difference  equations  now  take  the  form 
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P  (t+A)  =  P  (t)(l  -  XA)  +  [1  -  P(t)]P,  (t)*iA 

H  u  3D 

+  [1  -  Pa(t)][l  -  Pb(t)]oMA  ,  (73) 

and 

Pb(t+A)  =  Pb(t) (1  -  XA)  +  [1  -  Pb(t)]Pa(t)MA 

+  [1  -  Pa(t)][l  -  Pb(t)](l  -  a)MA  .  (74) 

Rearrange  and  take  the  limit  to  get 

P^  +  (X  +  04/)Pa  -  m(1  -  a)Pb  -  M[a  -  (1  -  ft)PaPbJ  -  0  , 

(75) 

and 

Pb  +  [X  +  (1  -  a)/i]Pb  -  MoPa  -  m(1  “  a  -  o<PaPb)  =  0  • 

(76) 

The  problem  becomes  much  more  tractable  if  et  =  1/2,  there 
being  no  preferential  computer  which  is  serviced  first. 
Equations  (75)  and  (76)  become 

K  +  (A  +  i)  Pa  -  tPb  -  2(1  -  PaV  -  0  > 


(77  ^ 
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and 


K+Q+  i)  Pb  -  2Pa  -  2(1  -  W  -  0  '  <78> 


The  symmetry  between  Pg  and  now  allows  the  distinction 
between  subscripts  to  be  dropped  and  we  have  only  one 
differential  equation  to  solve,  in,  say,  the  variable  Pq: 

n*CX  +  2)  Pe  -  fPa  -  K1  -  pe)  ■  0  :  <79> 


or,  rearranging, 


Pe  ~  -  2Pe  -  Xte  +  2  >  <80> 


which  is  a  form  of  the  Ricatti  equation.  The  solution  of 
this  equation  is  possible  because  the  coefficients  of  the 
quadratic  form  are  constants  and  the  method  may  be  found 
in  most  texts  on  differential  equations  [93.  Only  the 
solution  is  given  here.  Define 


P, 

1 


A  + 


(81) 
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and 


Then  the  general  solution,  P^(t),  is 


(82) 


Pe(t)  '  P1  r  u  i 

p  (t)  p,  =  Kexp  [-  2(P1  •  P2)t]  (83) 

e  2 

and,  if  Pg(0)  =  1, 

P  (c)  =  V1  -  V  -  V*  -  V  exp  [-  ^<Pl  -  V*]  ^ 

6  (L  "  P2^  "  C1  *  p]_)  exP  [l*  2^Pl  ’  P2^C] 

(84) 

Pg(t)  is  the  probability  that  either  of  the  two  com¬ 
puters  is  on  at  time  t.  Hence,  the  probability  that  the 
duplex  system  is  operative,  at  t  is  again 

P(t)  -  1  -  [1  -  Pe(t)  ]2  .  (85) 

2 

In  the  limit,  we  get  P^  =  2P^  -  P^  as  the  asymptotic 
probability.  This  probability  is  shown  in  Fig.  A-27,  and 
should  be  compared  to  Figs.  A-12  and  A-26. 
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Fig.  A-27 — Asymptotic  availability  of  duplex  computers 
(limited  service) 
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13.  TRIPLEX  COMPUTERS  WITH  FLEXIBLE  SERVICE- -EXPONENTIAL 


SERVICE  AND  FAILURE  DISTRIBUTIONS 


A  system  made  more  reliable  by  adding  a  third  computer 
can  be  operated  in  two  ways.  First,  adequate  error  checking 
is  assumed,  and  therefore  the  system  fails  only  when  all 

o 

three  machines  fail.  In  this  case,  P(t)  =  1  -  Cl  -  P^ ( t ) 0  , 
where  Pc(t)  is  given  by  Eq.  (70). 

If  the  three  machines  do  not  have  enough  self-checking, 
their  outputs  should  be  majority  voted.  With  this  method 
we  have 

3 

p(t)  -  £  (k)pc(t>[1  '  Pc(t)]3k  "  3Pc<C)  ■  2Pc(t)  • 


14.  THE  MULTI- PROCESSOR  SYSTEM— EXPONENTIAL  SERVICE  AND 
FAILURE  DISTRIBUTIONS 


Finally,  the  reliability  of  the  so-called  multi¬ 
processor  must  be  discussed.  Again,  the  theory  underlying 
the  multi-processor  structure  is  given  in  Sec.  IV-2  and 
only  the  probabilistic  analysis  will  be  done  here.  The 
system  is  shown  schematically  in  Fig.  A-28,  and  assumes 
the  simplification  that  all  the  switching  circuits  are 
perfectly  reliable.  If  this  is  not  the  case,  its  effect 
can  probably  be  absorbed  into  the  failure  properties  of 
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'N 


I  M  units 
j  per  computer 


J 


m  computers  * 

*  Each  column  is  a  complete  compu'er 


k  spare  computers 


Fig.A-28— The  Multi-processor 


the  individual  units.  Furthermore,  all  the  individual 
units  are  assumed  to  be  identical.  Clearly,  a  memory  is 
not  the  same  as  an  input/output  channel,  but  for  the 
purposes  of  this  probabilistic  analysis  it  is  assumed  to 
be.  The  following  definitions  are  needed: 
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M  =  Number  of  units  needed  to  make  up  a  single 
computer , 

m  =  Number  of  computers  need  to  make  up  the  processor, 
N  =  The  total  number  of  parts  in  the  M  x  m  system, 

I  =  The  failure  rate  of  a  single  (average)  part, 

H  =  Service  rate  for  a  single  unit  (flexible  service), 
k  =  Number  of  spare  units  of  each  type. 

Let  P^(t)  be  the  probability  that  a  unit  is  on  at  t. 
This  can  be  found  by  replacing  N,  the  number  of  parts  in 
one  unit  in  Eq.  (30),  by  N/Mm,  the  number  of  parts  per  unit 
in  the  multi-processor  case,  and  we  get 

+  NX  '+W  eXp[-(^  +  ‘]  (8?) 

There  are  enough  units  of  type  i  only  if  at  least  m  out 
of  m+k  are  operating.  The  probability  of  this  event  is 

m+k 

Pi(t)  =  I  b  [m+k»j’Pu(t)] 

j=m 

m-1  N 

-  I  pu(t>]  J 

j=o 


1 


.  (88) 
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Finally,  the  multi-processor  system  is  on  only  if  all 
the  unit  types  have  at  least  m  of  m+k  on,  and  since  all 
the  unit  types  are  taken  to  be  identical  and  independent 

M 

-] — r  M 

P(t)  =  I  Pi(t)  =  [Pt(t)]  .  (89) 

Let  P^  =  lim  P^(t);  i.e.,  the  value  of  P^(t)  when 

f-  -too 

P  =  Mmji/ (NX  +  Mm/i).  Then  P  ,  the  asymptotic  availability 
u  00 

for  the  entire  processor,  is  shown  for  various  values  of 
NX,  M,  m,  n>  a^d  k  in  Figs.  A-29  to  A-40. 


10  (asymptotic  availability  of  entire  processor) 
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Fig.A-29 — Asymptotic  availability  of  multi -processor 
(exponential  service) 


availability  of  entire 


10^  (asymptotic  avai [ability  of  entire  Drocessor) 
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Fig.  A-37 — Asymptotic  availability  of  multi-processor 
(exponential  service) 


Fig.A-38  —  Asymptotic  availability  of  multi-processor 
(exponential  service) 


)0^  (asymptotic  availability  of  entire 


Fig.A-39 — Asymptotic  availability  of  multi-processor 
(exponential  service ) 
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Appendix  B 

SEMICONDUCTOR  DEVICES- -BEHAVIOR  VS.  STRESS 
1.  INTRODUCTION 

Failure  modes  and  mechanisms,  and  the  effects  of 
stresses  can  be  considered  simultaneously  for  three  types 
of  semiconductor  devices:  transistors,  diodes,  and  inte¬ 
grated  circuits. 

The  general  fabrication  process  for  planar/mesa/ 
diffused  devices  will  be  briefly  discussed.  The  first 
step  is  the  growth  of  a  large  single  crystal  of  silicon 
with  controlled  amounts  of  impurities  added.  The  silicon 
is  sliced  into  wafers  which  are  processed  by  cleaning, 
polishing,  and  etching  operations. 

After  a  sequence  of  photolithographic,  chemical 
diffusion,  and  vacuum  deposition  processes,  each  wafer 
contains  many  (perhaps  several  hundred)  identical,  selec¬ 
tively  altered  regions,  each  of  which  will  perform  the 
function  of  a  diode,  transistor,  or  more  complex  combina¬ 
tion  (integrated  circuit). 

The  wafer  is  cut  into  individual  device  areas  (dice), 
which  are  affixed  to  mounting  bases  or  headers;  leads  are 
attached  by  some  process  such  as  thermocompression  bond¬ 
ing,  and  a  suitable  enclosure  (glass  tube,  can,  flatpack) 
is  provided. 

The  significant  stresses  affecting  semiconductor 
failure  are  temperature,  voltage,  current,  and  mechanical. 
Temperature  directly  accelerates  all  destructive  physico¬ 
chemical  processes  and  indirectly  produces  mechanical 
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effects  by  expansion  and  contraction.  Direct  mechanical 
effects  (shock,  vibration)  are  probably  negligible,  al¬ 
though  20,000  G  centrifuging  is  a  useful  screening  test 
for  certain  manufacturing  and  bulk  defects. 

Voltage  directly  produces  internal  fields  and 
gradients.  The  failure  rate  is  usually  quite  nonlinear 
with  voltage.  Suitable  derating  and  protection  against 
accidental  overload  can  essentially  eliminate  voltage  as 
a  direct  stress.  Voltage  and  current,  simultaneously 
applied,  result  in  power  dissipation  which  becomes  a 
temperature  stress.  If  useful  circuit  performance  is  to 
be  obtained,  some  power  must  be  dissipated.  Derating  is, 
of  course,  effective  in  reducing  the  stress,  but  the  re¬ 
quirements  of  driving  loads,  parasitic  inductance,  and 
capacitance  put  a  lower  limit  on  power  level.  Either 
voltage  or  current  may  directly  produce  harmful  effects 
in  the  bulk  material  or  in  the  interconnections  by  elec¬ 
trolysis,  non-uniform  current  density,  and  other 
mechanisms . 

2.  FACTORS  CONTRIBUTING  TO  SEMICONDUCTOR  FAILURE 

The  origins  and  modes  of  semiconductor  failures  may 
be  classified  in  various  ways.  Among  these  ways  is  the 
option  of  grouping  them  according  to  the  factors  causing 
failure  which  derive  from  bulk,  contact,  and  packaging. 

Bull  Factors 

Bulk  factors  include  the  following  possibilities: 

1)  The  impurity  profile  is  originally  incorrect  or  is 
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t-hanged  under  stress;  2)  crystallographic  defects  exist, 
that  is,  dislocation  or  stacking  faults  (diffusion  of 
phosphorus  in  silicon  may  proceed  1000  times  faster  along 
a  grain  boundary  than  through  a  perfect  crystal) ;  3)  gross 
mechanical  defects  exist  such  as  cracks  or  strains. 


Contact  Factors 

When  the  passivating  oxide  layer  is  removed  to  permit 
vacuum  deposition  of  a  contact  pad,  oxide  immediately 
starts  to  regrow.  The  material  (aluminum,  usually)  and 
the  time-temperature  of  the  alloying  process  must  be  such 
that  the  contact  material  penetrates  or  reduces  the  oxide 
layer,  but  does  not  extend  into  any  undesired  regions 
either  vertically  or  horizontally. 

Bonds  to  the  contacts  and  to  the  external  leads  are 
major  sources  of  failure.  Internal  aluminum  interconnec¬ 
tions  in  integrated  circuits  show  occasional  defects. 

Packaging  Factors 

Major  factors  in  packaging  are  attachment  of  the  dies 
to  the  mounting  surface,  integrity  of  hermetic  seals,  and 
quality  of  the  final  closure  (usually  a  weld  or  glass 
fusion) . 

3,  SEMICONDUCTOR  FAILURES  LISTED  BY  CAUSE 

Below  is  a  rather  detailed  list  of  failure  origins, 
modes  and  mechanisms.  All  of  these  failures  were  observed 
in  actual  failure  analysis. 
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Gross  Manufacturing,  Design,  and  Mechanical  Failures 

Faulty  package  closure; 

External  lean  breakage; 

Internal  pins  hit  top  of  can; 

Internal  lead  dress  causes  shorts; 

Die  off  header; 

Fractured  die; 

Cracked  die; 

Tool  damage  (scratch,  cut,  or  nick  on  surface); 
Void  ir'Jer  die  (supposedly  precluded  by  vibrating 
during  attachment); 

Loose  foreign  material  (seems  avoidable  but  is 
observed) ; 

Package  mechanical  defect  other  than  leak; 

Package  mislabeled; 

Internal  lead  connected  to  wrong  pin; 
Short--misplaced  bond; 

Short  or  poor  performance--mask  or  registration 
error ; 

Floating  junctions,  PNPN  switching. 


Contact  and  Interconnection  Failures 


Bond  off  clean; 

Bond  off--intermetallic  phase  (plague); 

Bond  off--bad  metal  surface; 

Bond  broken  at  heel; 

Bond  broken  within  bond; 

Bond  off  lifting  pad  material; 

Bond  of::  ’.if ting  silicon  and  pad  material; 
Insufficient  pad  or  interconnect  material-- 
lack  of  metal; 

Insufficient  pad  or  interconnect  material-- 
intermetallic  phase; 

Insufficient  pad  or  interconnect  material-- 
aluminum  oxide/hydrcxide; 

Weld  off  clean; 

Weld  broken  at  heel; 

Wire  broken  in  span; 

Insufficient  oxide  removal  on  window; 

Open  aluminum  at  passivation  step. 
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Failures  Due  to  the  Surface  and  Surface  Environment 


Surface  contamination--die; 

Surface  contamination--header ; 

Inversion  layer  (may  be  permanent,  or  may  dis¬ 
appear  after  stress  removal,  high  temperature 
bake,  uncapping,  or  solvent  or  acid  wash); 

Accumulation  layer  (same  comment); 

Hermetic  seal  leak; 

Inclusion  of  corrosive  contaminant  (insufficient 
etchant  cleanup) ; 

Inclusion  of  ionic  contaminant  (notably  water , 
possibly  phosphorus  compound)  [1]; 

Surface  ionization — ionic  conduction; 

Passivation  layer  too  thin-- leakage  or  short; 

Local  defect  (pit)  in  passivation  layer-- leakage 
or  short; 

Aluminum  migration  through  oxide; 

Oxidation  of  ohmic  contacts; 

Diffusion  surface  drift. 


Bulk  and  Process  Failures 


Crystallographic  defects; 

Internal  contamination; 

Mask  and  registration  errors; 

Time- temperature-constituent  inadequacies-- 
wrong  impurity  profile; 

Thermal  fatigue; 

Dislocation-induced  processes. 


Gross  Externally  Induced  Failures 

Electrical  overload  in  semiconductor; 

Thermal  runaway; 

Second  breakdown; 

Melting  (usually  gold-silicon  at  370  C  eutectic 
point) ; 

Overvoltage  breakdown. 
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Integrated  Circuits  and  Transistors--Failure  Origins, 
Modes,  and  Mechanisms 

Table  B-l  presents  the  results  of  a  review  of  life 
tests  for  integrated  circuits  and  transistors.  Twelve 
sources  gave  specific  information  as  to  the  nature  of  the 
failures.  Table  B-l  must  be  jsed  with  Table  B-2  which  is 
a  key  to  the  Origin,  Mode,  and  Mechanism  number  given  in 
Column  2  of  Table  B-l. 


Semiconductor  Device  Degradation 

Most  transistor  degradation  studies  have  been  con¬ 
cerned  with  time  variation  of  two  static  parameters,  h.,., 

r  h. 

(current  gain)  and  (collector  diode  reverse  current). 

Variations  of  these  parameters  under  various  conditions 
of  temperature,  voltage,  and  power  are  often  large  compared 
to  the  observed  behavior  of  the  passive  components.  This 
is  not  necessarily  cause  for  alarm,  at  least  for  digital 
circuits,  as  a  circuit  which  will  operate  at  some  minimum 
gain  (hpg)  will  ordinarily  tolerate  any  higher  gain  (to 
infinity).  Likewise,  a  circuit  designed  for  some  maximum 
collector  reverse  current  (Iqjjq)  will  operate  at  any  lower 
value  (to  0). 

Life  test  data  on  10,300  transistors  of  24  types  [2] 
produced  the  h^  and  variations  shown  in  Table  B-3, 

Tests  included  high-temperature  storage,  operating  life, 
and  cycled  operating  life,  over  periods  of  500  to  10,000 
hr,  with  most  tests  of  1000-hr  duration.  Positive  and 
negative  percentage  changes  are  shown  for  the  two  extreme 
cases  in  each  test,  and  log-normal  percentages  ranges  are 
used . 
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Table  B-l 


INTEGRATED  CIRCUITS  AND  TRANSISTORS  FAILURE  ORIGINS, 
MODES,  AND  MECHANISMS 
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Table  B-2 

FAILURE  ORIGIN,  MODE,  AND  MECHANISM  KEY 


1.  Leaky  package. 

2.  Broker,  bond. 

3.  Broken  bond  at  header  pin. 

4.  Broken  bond-- aluminum-oxide  adhesion. 

5.  Broken  or  defective  bond--purple  plague. 

6.  Inoperative. 

7.  Package  mechanical  defect  other  than  leak. 

8.  Overvoltage  breakdown. 

9.  Fabrication  errors-- floating  junctions, 

PNPN  switching. 

10.  Oxide  breakdown-- aluminum  shorts  to 

silicon. 

11.  Aluminum  corrosion  under  power. 

12.  Incomplete  window  etch--oxide  regrows. 

13.  Diffusion  surface  drift. 

14.  Gold-silicon  eutectic  melting  (370  C) . 

15.  Charge  on/in-passivation  layer. 

16.  Channeling. 

17.  Inversion. 

18.  Bond  off  clean  leaving  aluminum. 

19.  Scratch. 

20.  Broken  lead. 

21.  Cracked  bar. 

22.  Contamination. 

23.  Lead  f atigue/tens ion  test. 

24.  Storage  life  test. 

25.  Substandard  conductive  film. 

26.  Aluminum  penetration  of  silicon  dioxide. 

27.  Corrosion-- incomplete  etchant  cleanup. 

28.  Mask/registration  error. 

29.  Package  mislabeled. 

30.  Loose  foreign  material. 

31.  Internal  short--excess  solder. 

32.  Tool  damage. 

33.  Aluminum  open  at  passivation  step. 

34.  Lead  dress. 

35.  Die  off  header. 


-233- 


Table  B-3 

LIFE  TEST  DATA  ON  AND  ICB0  VARIATION 


h 

Variation 

r  L 

ICB0 

Variation 

Percentage 
Variation  from 
Initial  Value 

No .  of 
Cases 

Percentage 
Variation  from 
Initial  Value 

No .  of 
Cases 

+1031  - 

1500 

1 

+>102,300 

5 

+  701  - 

1030 

1 

+25,501 

- 

102,300 

5 

+  467  - 

700 

1 

+  6,301 

- 

25,500 

3 

+  301  - 

466 

0 

+  1,501 

- 

6,300 

4 

+  183  - 

300 

0 

+  301 

- 

1,500 

9 

+  101  - 

182 

6 

+  0 

- 

301 

16 

+  41  - 

100 

12 

+  0 

- 

75 

24 

+  0  - 

40 

24 

76 

- 

94 

10 

0  - 

29 

25 

Q5 

- 

99 

6 

-  30  - 

50 

12 

-  51  - 

64 

3 

-  65  - 

75 

1 

-  76  - 

82 

0 

-  83  - 

88 

0 

-  over 

88 

Note  that  the  variations  shown  are  from  initial 
values  for  the  two  most  deviant  transistors  in  each  test. 
The  initial  values  are  ordinarily  far  removed  from  the 
initial  specification  limits.  Thus,  a  transistor  showing 
an  iQgQ  increase  of  6300  per  cent,  might  have  gone  from 
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one  to  64  nanoamperes  with  a  specification  limit  of  100 

nanoamperes.  Therefore,  while  a  large  variation  certainly 

indicates  instability,  it  does  not  necessarily  indicate 

failure.  Of  the  10,300  units  tested,  33  were  considered 

I  failures,  using  arbitrary  end-of-life  limits,  and 
CdU 

in  many  cases,  high  stress  levels. 
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Appendix  C 

RESISTORS- -BEHAVIOR  VS.  STRESS 

1.  CARBON  COMPOSITION  RESISTORS 

Carbon  composition  resistors  consist  of  carbon  particles 
in  an  organic  binder  with  inserted  leads  and  a  non-hermetic 
covering.  They  are  the  most  common  and  economical  parts 
in  the  electronic  industry;  the  price  for  a  1/2-watt, 
5-per-cent- tolerance  resistor  in  quantities  of  10,000  is 
$0,038.  The  behavior  of  carbon  resistors  under  various 
stresses  is  shown  in  Table  C-l. 

Additional  factors  affecting  composition  resistor 
behavior  are  short-time  overload  (±2.5  per  cent  per 
MIL-R-11A)  and  soldering  (±3  per  cent  per  MIL-R-11A). 

With  suitable  precautions  against  overload  and  solder¬ 
ing  in  a  humidity-controlled  environment,  these  values 
can  be  significantly  reduced.  Clearly,  humidity  and 
temperature  (amoient  or  load-dependent)  are  the  major 
factors  influencing  composition  resistor  behavior.  If 
both  are  controlled,  the  composition  resistor  becomes  a 
satisfactorily  stable  and  extremely  reliable  part.  Worst 
case  (see  Sec.  III-2)  end-of-life  design  tolerances  for 
composition  resistors  of  ±5  per  cent  manufacturing  toler¬ 
ance  range  from  ±10  per  cent  in  moderate  environment  to 
±20  per  cent  in  severe  environments. 

Reference  to  the  Allen-Bradley  load-life  nomographs 
[3]  shows  that  only  2.5  per  cent  resistance  decrease  is 
expected  for  a  1/2-watt  resistor  operated  for  ten  years 
at  22  per  cent  of  rated  wattage  at  55°C  lead  temperature; 
or  at  50  per  cent  of  rated  wattage  at  40°C  lead  temperature. 
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Table  C-l 

CARBON  COMPOSITION  RESISTORS- -BEHAVIOR  VS.  STRESS 


Part 

Test  or  Stress 

Result 

Source 

Allen-Bradley 

113  hr,  55°C 

102il  +1.8  to  +5.1% 

type 

EB,  \  watt 

95%  relative 

104f2  +3.2  to  +6.9% 

[1] 

humidity 

106S1  +4.5  to  +8.7% 

As  above 

350  volts 

io3a  -0  to  -0.2% 

applied . 

105fl  -0.5  to  -3.0% 
106fi  -3.1  to  -6.3% 

Cl] 

As  above 

MIL-R-11A 

-0,  +2% 

[l] 

temperature 

cycle. 

As  above 

1  watt,  25°C, 
113  hr. 

+2,  -4% 

Cl] 

Allen-Bradley 

Catastrophic 

None,  when  oper- 

r  i  ] 

composition 

failures , 

ated  within 

total  pro¬ 
duction. 

ratings  (!) 

IRC  type 

Resistance  vs. 

102SI  +2  to  +6% 

GBT-^,  \  watt 

temperature , 
MIL- R- 11, 

-55°C  same 
+105°C. 

104fl  +6  to  +9% 

106n  +7  to  +11% 
io2n  -o  to  -4% 
io4a  -l  to  -4% 
io6si  -l  to  -4% 

1 

As  above 

MIL- R- 11 

-0  to  -.02%/ 

[2] 

voltage 
coefficient . 

volt 

As  above 

MIL- R- 11 

-1%,  +6%, 

[2] 

moisture 

resistance. 

?  7 

10 '-10  s* 

As  above 

70°C  load 

10 2n  +2.5,  -4% 

life,  1000 
hr. 

io4n+i,  -5% 

106fi,  -4% 

CSI 
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2.  METAL  AND  CARBON  FILM  RESISTORS 

Carbon  film  resistors  will  be  excluded  from  onsid- 
eration  as  the  metal  film  equivalents  cost  only  slightly 
more  and  perform  better.  Metal  film  resistors  consist 
of  a  tubular  ceramic  substrate  on  which  a  thin  metallic 
film,  such  as  nickel-chromium,  is  vacuum-deposited.  A 
spiral  is  cut  in  the  film  to  produce  the  final  resistance 
value,  usually  under  automatic  control.  End  caps,  leads, 
and  protective  coating  complete  the  part.  Initial 
tolerances  of  1.0  to  0.1  per  cent  are  readily  achieved, 
with  1000-hr,  125°C  load-life  degradation  well  inside 
the  ±0.5  per  cent  limits  of  the  applicable  military 
specification  (MIL-R-10509E,  characteristic  E) . 

A  table  for  metal  film  resistors,  like  the  one  above 
for  composition  resistors,  would  not  be  particularly 
informative,  because  observed  behavior  over  reasonable 
test  times  does  not  show  significant  variations.  For 
example,  o5,000  IRC- type  XLT  metal  film  resistors  have 
run  for  4000  hr  at  25°C,  1/16-watt  dissipation,  with  no 
resistance  change  greater  than  0.5  per  cent  [4,5]. 

Purely  theoretical  application  of  physics-of-failure 
to  prediction  of  metal  film  resistor  life  appears  inade¬ 
quate,  because  so  many  possible  mechanisms  may  combine 
in  the  degradation  process.  The  basic  modes  of  failure 
are  change  in  geometry  and  change  in  resistivity.  Re¬ 
sistivity  may  change  due  to  annealing,  defect  decay, 
phase  separation,  and  other  mechanisms.  Geometry  may 
change  due  to  oxidation,  electrolysis,  effects  of  initial 
discontinuities,  and  other  mechanisms. 
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Some  combination  of  theoretical  and  experimental 
application  seems  necessary  to  verify  mathematics  of  the 
mechanisms  and  supply  values  for  the  constants  before 
quantitative  information  may  be  obtained.  Carefully 
designed  accelerated  tests,  some  of  which  actually 
separate  the  operative  mechanisms,  may  contribute  the 
required  information.  Some  recent  work  [6]  indicates  that 
stress  relief  in  the  early  part  of  life,  and  internal 
oxidation  and  precipitation  in  the  later  period,  are  the 
dominant  mechanisms. 

Insofar  as  actual  life  tests  are  concerned,  the  IkC- 

type  XLT  metal  film  resistor  has  demonstrated  a  failure 

3 

rate  of  .  00047o/10  hr  (resistance  change  less  than  0.3 
per  cent,  from  initial,  60  per  cent  confidence)  and  the 
IRC- type  GEM  is  currently  undergoing  qualification  under 
MIL-R-55182  with  an  objective  of  ,01%/10^  hr  or  better. 
Costs  of  these  parts  and  the  MEA-TO,  a  physically  compar¬ 
able  resistor  without  officially  demonstrated  reliability, 
in  10,000-piece  quantity,  are 


IRC  Type 

Price 

XLT 

$3.34 

GEM 

$1.95 

MEA-TO 

$0.19 

Exactly  how  the  cost  divides  into  direct  cost  of 
increased  reliability  and  "reliability  overhead"  (test  and 
documentation  costs)  is  not  known.  Some  inference  can  be 
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drawn  from  the  fact  that  every  XLT  is  x-rayed,  the  films 
are  inspected,  then  microfilmed  and  stored;  an  IBM  card, 
showing  the  complete  history,  accompanies  each  resistor, 
and  it  is  possible  at  any  future  time  to  determine 
material  batches  and  sources,  specific  inspection  and 
test  findings,  and  the  identity  of  every  person  involved 
in  fabrication  of  any  single  resistor! 

3.  TIN  OXIDE  RESISTORS 

These  resistors  are  manufactured  by  chemically  bond¬ 
ing  tin  oxide  into  Pyrex  glass  rods  at  red  heat.  The  rods 
are  spiral-cut  to  final  resistance  value,  and  end-caps, 
leads,  and  protective  coating  are  added. 

Temperature,  either  ambient  or  dissipation-induced, 
is  the  most  significant  stress,  and  diffusion  and  dis¬ 
sociation  are  the  dominant  degradation  mechanisms  [6], 

Extensive  test  data  on  Corning  styles  N20  and  A100 
tin  oxide  resistors  are  available  [7],  and  the  results 
are  summarized  in  Tables  C-2  and  C-3  below. 

Table  C-2 

TEST  DATA  ON  TYPE  N20  TIN  OXIDE  RESISTOR3 


No.  in 
Group 

Dissipation 

(watts) 

Test  (hr) 

Maximum  Percentage 
Change  from 

Initial  Nominal 

+ 

- 

+ 

150 

0.1 

41,000 

0.3 

0.3 

1.1  0.5 

150 

0.2 

40,000 

0.2 

1.0 

0.9  0.9 

150 

0.3 

54,000 

0.0 

1.1 

0.7  1.5 

150 

0.6 

39,000 

1.3 

0.3 

1.5  0.6 

a0. 5-watt  rated  dissipation. 
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Table  C-3 

TEST  DATA  ON  TYPE  A- 100  TIN  OXIDE  RESISTOR3 


Dissipation 

(watts) 

Test  (hr) 

Maximum  Percentage 
Change  from 

Initial 

+ 

- 

199 

0.625 

1 

W 

200 

0.321 

1 

Hm 

0.5 

200 

0.111 

wamm 

1.0 

0.5 

0.5-watt  rated  dissipation. 


If  ±1.5  per  cent  resistance  change  is  defined  as  ac¬ 
ceptable  degradation,  the  above  units  (with  some  curiosity 
about  a  possible  600tL1  Type-A  unit)  show  zero  failures  in 
47.1  million  unit-hours,  for  . 0035%/1000  hr  failure  rate 
at  60  per  cent  confidence. 

Several  Corning  tin  oxide  types  have  been  qualified 
to  various  reliability  levels.  Results  with  prices  are 
tabulated  in  Table  C -4  below. 

Table  C-4 

RELIABILITY  OF  CORNING  TIN  OXIDE  RESISTORS 


Corning 

Type 

Failure  Rate 
(%/ 10 3  hr) 

Conf . 
Level 

1 

wBmm 

II 

■1 

HNR-60 

.00057 

60 

i 

(a) 

A51 

.0023 

60 

2 

.51 

HRL-07 

•°15  h 

60 

2 

.32 

N20 

.0035 

60 

1 

.08 

N60 

,0017° 

60 

1 

.10 

Not  available. 


^Computed  by  the  author--not  stated  by  Corning. 
cCorning  states  similar  type  exhibited  this  rate. 
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Appendix  D 

CAPACITORS- -BEHAVIOR  VS.  STRESS 


1.  DIPPED  MICA  CAPACITORS 


This  is  the  standard  capacitor  of  the  industry  for  the 
range  1  to  10  picofarads.  The  part  is  formed  by  screening 
silver  paste  onto  thin  sheets  of  mica  (naturally  occurring 
aluminum  silicate),  oven-firing,  then  stacking  the  sheets 
with  tin-lead  foil  strips  inserted  at  alternate  ends. 

Clamps  and  leads  are  applied  to  the  ends  of  the  stack,  and 
four  coats  of  epoxy- impregnated  phenolic  resin  are  applied. 

Voltage  and  temperature  are  the  significant  stresses. 
Failure  occurs  when  a  sufficient  quantity  of  energy  is 
absorbed  within  the  dielectric  to  cause  damage  and  permit 
excessive  current  flow  [1], 

Failure  rates  for  dipped  mica  capacitors  have  been 
established  by  accelerated  life  tests  using  high  voltage 
and  temperature  simultaneously.  The  time  to  failure,  , 
at  a  low  voltage  and  temperature  (E^,  T£)  is  given  by  the 
manufacturer  as  [2" 


t 


2 


(T  -T  )/10 

2 


(1) 


Life  test  data  available  were  obtained  at  500,  750, 
and  1000  volts  and  85°,  125°,  and  150°C.  Extrapolation 
to  (say)  25  volts  and  55°C  would  yield  very  impressive 
reliability  figures.  There  is,  however,  evidence  that  the 
exponent  of  the  voltage  ratio  is  not  temperature- independent. 
In  a  study  conducted  by  RCA  [3],  approximate  values  of  the 
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exponent-versus-temperatur e  of  the  actual  test  were 
as  shown  in  Table  D-l 


Table  D-l 

TEMPERATURE  DEPENDENT  VOLTAGE  EXPONENT  FOR 
ACCELERATED  LIFE  TESTS  (RCA) 


Test  Temperature 

Exponent  of  ^1^  "2  in  Eq.  (1) 

150°C 

10.6 

125°C 

9.6 

85°C 

_ 

6.7 

This  suggests  that,  for  55°C  operation,  the  temperature 
correction  should  be  applied  first  to  give  hypothetical 
results  of  testing  at  that  temperature,  then  the  voltage 
ratio  correction  should  be  applied,  using  an  extrapolated 
value  of  the  exponent.  Curvature  of  the  function  repre¬ 
sented  by  the  three  data  points  available  makes  extrapola¬ 
tion  difficult,  but  a  value  between  zero  and  two  seems  reason¬ 
able.  Most  recent  failure  data  indicates  zero  failures 
for  41.8  x  10^  unit  hours  at  85°C  and  225  volts.  Ex¬ 
trapolation  to  55°C  and  25  volts  using  a  voltage  ratio 

3 

exponent  of  1.0,  gives  a  failure  rate  of  .00008%/ 10  hr 
at  90  per  cent  confidence.  This  applies  to  the  M2DM-quality 
capacitors  which  use  all-silver  internal  connections, 
thicker  mica,  a  burn-in  procedure,  and  extra-heavy  coating. 
With  the  ^ame  burn-in,  the  manufacturer  claims  a  factor 
of  4  should  be  applied  to  obtain  the  failure  rate  for 
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production  capacitors.  Although  some  information  ■indicates 
that  failure  rate  is  proportional  to  capacitor  size,  the 
life  test  data  for  the  above  estimates  included  capacitors 
in  the  range  180  to  10,000  picofarads.  Hence,  the  estimates 
may  be  considered  reasonable  averages  for  the  range  normally 
used . 

In  another  large-scale  experience  at  constant  stress 
1  ,  it  is  found  that  the  voltage  ratio  exponent  increased 
with  decreasing  temperature,  as  in  Table  D-2. 

Table  D-2 

TEMPERATURE  DEPENDENT  VOLTAGE  EXPONENT  FOR 
ACCELERATED  LIFE  TESTS  (ENDICOTT  &  ZUELLNER) 


Test  Temperature 

Exponent  of  E^/E^  i-n  Eq .  (1) 

125°C 

11.4 

147°C 

10.8 

200°C 

6.0 

This  information,  however,  was  obtained  at  2000  to  6000 
volts,  while  the  RCA  tests  applied  500  to  1000  volts,  which 
though  still  far  away,  is  much  nearer  the  intended  use  range. 

The  cost  of  capacitor  reliability  is  roughly  indicated 
by  the  comparison  in  Table  D-3. 
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Table  D-3 

PRj.CE  AND  RELIABILITY  OF  MICA  CAPACITORS 


g 

El-Menco  Type  Mica  Capacitor 

Price  ($) 

A  (%/103  hr) 

DM-20,  180  pf,  5%,  100V,  qty.  500 

.03933 

.00032 

M3DM-20,  180  pf,  5%,  100V,  qty.  500 

.3531 

.00008 

^ote  that  the  DM-20  price  should  be  corrected  for  burn- 
in  cost  and  losses,  amounting  to  perhaps  $.03  more  per  unit. 


2.  GLASS  CAPACITORS 

Similar  in  appearance  to  molded  mica  capacitors,  glass 
capacitors  are  made  by  alternating  layers  of  glass  dielec¬ 
tric  and  conductive  material,  then  fusing  the  entire  assembly 
into  a  monolithic  glass  block  with  hermetic  end  seals. 

Significant  stresses  and  failure  modes  are  the  same 
as  for  the  mica  capacitor.  Failure  mechanisms  include  Joule 
heating  due  to  thermionic  emission,  or  quantum  tunneling,  de¬ 
pending  on  ionic  concentration  in  the  dielectric  [4].  Yet 
another  anomaly  in  the  step-stress  technique  is  reported  by 
Best,  et  al .  [4n  wherein  capacitors  subjected  to  30-minute 
steps  of  100  volts  failed  at  4500  volts,  while  units  ex¬ 
periencing  5-hour  steps  of  50  volts  failed  at  6000  volts. 

The  Corning  Type  CYFR  high-reliability  capacitor  has 
a  failure  rate  of  .0003%/ 1000  hours  at  1/4  x  rated  voltage 
and  55°C,  as  nearly  as  can  be  estimated  from  the  confusing 
presentation  of  life  data  [5].  This  item  clearly  carries 
the  usual  "reliability  overhead"  as  identically  manufactured 
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items  are  available  under  two  designations  differing  only 
in  the  amount  of  documentation.  Table  D-4  gives  quoted 
prices . 


Table  D-4 

PRICE  OF  GLASS  CAPACITORS 


Corning- Type  Glass  Capacitor 

Price  ($) 

CYFR,  180  pfd ,  J951  spec.  1000  qty. 

1.57 

CYFR,  180  pfd,  J950  spec.  1000  qty. 

1.20 

CYFM,  180  pfd  5000  qty. 

0.76 

TY06 ,  180  pfd  1000  qty. 

0.48 

Note  that  the  "standard  line"  TY06  glass  capacitor  costs 
more  than  the  "high-rel"  M2DM  mica  capacitor.  For  a  temper¬ 
ature-  and  humidity-controlled  ground  environment,  the  glass 
capacitor  appears  inefficient  from  the  economic  standpoint. 

3.  PAPER  AND  ELECTROLYTIC  CAPACITORS 

Actual  life  tests  of  El-Menco  mylar-paper  dipped 

£ 

capacitors  show  one  failure  in  14.3  x  10  unit  hours  at 
105°C  and  rated  voltage.  Extrapolation  to  55°C  and  25 
volts  (for  a  200  volt  capacitor),  using  the  correction 
formula  for  mica  capacitors,  gives  an  estimated  failure 

3 

rate  of  . 001%/10  hr  at  90  per  cent  confidence.  The  rela¬ 
tively  low  usage  of  paper  capacitors  in  typical  systems 
probably  makes  this  value  sufficiently  low  to  insure 
negligible  contribution  to  the  reliability  computation. 
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The  even  lower  usage  of  electrolytic  capacitors,  and 
the  availability  of  computer  grade  units  (e.g.,  Sangamo 
type  500)  having  excellent  stability  characteristics  when 
temperatures  art  moderate,  makes  detailed  consideration 
unnecessary.  Sangamo  states  that  the  type  500  "will  provide 
satisfactory  service  for  ten  years  or  longer"  [61.  Degrada¬ 
tion  data  through  5000  hours  at  85°C  are  available.  It 
should  be  noted,  however,  that  deterioration  of  the  elec¬ 
trolyte  is  a  true  wearout  mechanism  in  non-solid  electrolytic 
capacitors  [7], 
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Appendix  E 

A  COMPENDIUM  OF  FAILURE  STATISTICS 

Table  E-l  summarizes  the  failure  rate  information 
which  has  been  collected  from  a  large  variety  of  sources 
during  the  course  of  this  study.  An  effort  has  been 
made  to  eliminate  duplication  of  information,  but  some 
of  the  "summary"  type  entries  may  still  include  the 
individual  test  entries.  Discussion  of  the  origins  of 
unusually  high  failure  rates  is  given  in  Chap,  II. 


Table 
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Ge  transistor  |  j  14,060  I  I  .012  I  1  "  (Average  of  50  entries. 


Table  E- 1- -cont inued 
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Table  E-l--continued 
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Table  E- l--cor>tinued 
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Table  E-l--continued 
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